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Abstract 


Virtually  all  methods  of  learning  dynamic  models  from  data  start  from  the  same 
basic  assumption:  that  the  learning  algorithm  will  be  provided  with  a  single  or  mul¬ 
tiple  sequences  of  data  generated  from  the  dynamic  model.  However,  in  quite  a  few 
modem  time  series  modeling  tasks,  the  collection  of  reliable  time  series  data  turns 
out  to  be  a  major  challenge,  due  to  either  slow  progression  of  the  dynamic  process  of 
interest,  or  inaccessibility  of  repetitive  measurements  of  the  same  dynamic  process 
over  time.  In  most  of  those  situations,  however,  we  observe  that  it  is  easier  to  col¬ 
lect  a  large  amount  of  non-sequence  samples,  or  random  snapshots  of  the  dynamic 
process  of  interest  without  time  information. 

This  thesis  aims  to  exploit  such  non-sequence  data  in  learning  a  few  widely  used 
dynamic  models,  including  fully  observable,  linear  and  nonlinear  models  as  well  as 
Hidden  Markov  Models  (HMMs).  For  fully  observable  models,  we  point  out  several 
issues  on  model  identifi ability  when  learning  from  non-sequence  data,  and  develop 
EM-type  learning  algorithms  based  on  maximizing  approximate  likelihood.  We  also 
consider  the  setting  where  a  small  amount  of  sequence  data  are  available  in  addition 
to  non-sequence  data,  and  propose  a  novel  penalized  least  square  approach  that  uses 
non-sequence  data  to  regularize  the  model.  For  HMMs,  we  draw  inspiration  from 
recent  advances  in  spectral  learning  of  latent  variable  models  and  propose  spectral 
algorithms  that  provably  recover  the  model  parameters,  under  reasonable  assump¬ 
tions  on  the  generative  process  of  non-sequence  data  and  the  true  model.  To  the 
best  of  our  knowledge,  this  is  the  first  formal  guarantee  on  learning  dynamic  mod¬ 
els  from  non-sequence  data.  We  also  consider  the  case  where  little  sequence  data 
are  available,  and  propose  learning  algorithms  that,  as  in  the  fully  observable  case, 
use  non-sequence  data  to  provide  regularization,  but  does  so  in  combination  with 
spectral  methods.  Experiments  on  synthetic  data  and  several  real  data  sets,  includ¬ 
ing  gene  expression  and  cell  image  time  series,  demonstrate  the  effectiveness  of  our 
proposed  methods. 

In  the  last  part  of  the  thesis  we  return  to  the  usual  setting  of  learning  from 
sequence  data,  and  consider  learning  bi-clustered  vector  auto-regressive  models, 
whose  transition  matrix  is  both  sparse,  revealing  significant  interactions  among  vari¬ 
ables,  and  bi-clustered,  identifying  groups  of  variables  that  have  similar  interactions 
with  other  variables.  Such  structures  may  aid  other  learning  tasks  in  the  same  do¬ 
main  that  have  abundant  non-sequence  data  by  providing  better  regularization  in  our 
proposed  non- sequence  methods. 
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Chapter  1 
Introduction 


Learning  dynamic  models  from  data  is  the  traditional  topic  of  system  identification  [Ljung,  1999] 
in  control  theory  and  many  algorithms  have  been  proposed.  In  the  machine  learning  literature,  the 
learning  of  temporal  graphical  models,  such  as  dynamic  Bayesian  networks  [Ghahramani,  1998a; 
Murphy,  2002],  and  the  learning  of  various  types  of  Markov  models  [e.g.,  Abbeel  and  Ng,  2005; 
Beal  et  al.,  2002;  Ghahramani,  1998b;  Hsu  et  al.,  2009;  Rabiner,  1989;  Song  et  al.,  2010],  have 
been  extensively  studied. 

Virtually  all  methods  of  learning  dynamic  models  from  data  start  from  the  same  basic  as¬ 
sumption:  that  the  learning  algorithm  will  be  provided  with  a  single  or  multiple  sequences  of 
data  generated  from  the  dynamic  model.  However,  in  quite  a  few  modern  dynamic  modelling 
tasks,  a  major  difficulty  turns  out  to  be  the  collection  of  reliable  time  series  data.  In  some  of  these 
tasks,  such  as  learning  dynamic  models  of  galaxy  or  star  evolution,  the  dynamics  of  the  processes 
of  interest  are  far  too  slow  for  researchers  to  collect  successive  data  points  showing  any  mean¬ 
ingful  changes.  At  more  modest  time  scales,  the  same  problem  arises  in  the  understanding  of 
slow-evolving  human  diseases  such  as  Alzheimer’s  or  Parkinson’s,  which  may  progress  over  a 
decade  or  more.  In  other  situations,  the  dynamic  process  of  interest  may  not  be  able  to  undergo 
repetitive  measurements,  so  researchers  have  to  measure  multiple  instances  of  the  same  process 
while  maintaining  synchronization  among  these  instances.  One  such  example  is  gene  expression 
time  series.  In  their  study,  Tu  et  al.  [2005]  measured  expression  profiles  of  yeast  genes  along 
consecutive  metabolic  cycles.  Due  to  the  destructive  nature  of  the  measurement  technique,  they 
collected  expression  data  from  multiple  yeast  cells.  In  order  to  obtain  reliable  time  series  data, 
they  spent  a  lot  of  effort  developing  a  stable  environment  to  synchronize  the  cells  during  the 
metabolic  cycles.  Yet,  they  point  out  in  their  discussion  that  such  a  synchronization  scheme  may 
not  work  for  other  species,  e.g.,  certain  bacteria  and  fungi,  as  effectively  as  for  yeast.  Another 
example  is  cell  image  time  series.  In  a  recent  study  [Buck  et  al.,  2009]  on  cell  cycle  dependence 
of  protein  subcellular  location  inferred  from  images,  the  authors  discussed  some  challenges  in 
obtaining  time  series  of  cell  images:  “...  time-lapse  images  can  be  more  difficult  to  obtain  than 
single  images  of  cells  because  many  microscopes  do  not  maintain  a  viable  environment  for  the 
cells  they  image  (e.g.,  cells  die  after  some  time,  and  even  while  alive  they  are  not  under  con¬ 
stant  conditions ).  Furthermore,  repeated  excitation  of  dyes  for  fluorescence  imaging  causes 
photobleaching,  reducing  signed  and  leading  to  toxic  chemical  changes  ( phototoxicity ),  further 
perturbing  cells.” 
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Table  1.1:  Summary  of  thesis  work 


Model  Class 

First-order  Observable 

Hidden  Markov  Model 

Data  Assumption 

seq.+ 

non-seq. 

•  Non-sequence  data  as  regularization 

•  Significant  improvement  over  standard  sequence-only  methods 
when  sequence  data  is  few 

[Chapters  4  and  6] 

fully  non¬ 
sequence 

•  EM-type  algorithms 
maximizing  approximate 
likelihood 

•  Synthetic  data,  gene 
expressions  and  cell  images 

[Chapter  3] 

•  Spectral  algorithms  with 
fonnal  guarantee 

•  First  theoretical  statement  on 
learning  from  non-sequence 
data 

[Chapter  5] 

Learning  Bi-clustered  Vector  Auto-regressive  Model 

[Chapter  7] 

While  obtaining  reliable  time  series  can  be  difficult,  it  is  often  easier  to  collect  non-sequence 
samples,  or  snapshots  of  the  dynamic  process  of  interest.  For  example,  the  Sloan  Digital  Sky 
Survey  (SDSS)1  has  collected  images  of  millions  of  celestial  objects,  each  of  which  may  be  in  a 
different  phase  of  its  life  cycle.  In  medical  sciences,  a  scientist  studying  Alzheimer’s  or  Parkin¬ 
son’s  can  collect  samples  from  his  or  her  current  pool  of  patients,  each  of  whom  may  be  in  a 
different  stage  of  the  disease.  Or  in  gene  expression  analysis,  current  technology  already  enables 
large-scale  collection  of  static  gene  expression  data.  It  is  also  the  case  in  cell  image  analysis, 
as  concluded  by  Buck  et  al.  [2009]:  “A  method  using  un- synchronized  cells  with  single-image 
capture  would  have  the  advantages  of  avoiding  repeated  exposure  to  fluorescence  excitation 
(permitting  higher-energy  exposure  to  obtain  better  signal)  and  fewer  environment  viability  re¬ 
quirements. ” 

More  broadly,  in  social  and  medical  sciences  it  is  usually  the  case  that  longitudinal  study, 
the  collection  and  analysis  of  data  from  the  same  subjects  over  long  periods  of  time,  is  more 
powerful  but  also  expensive  than  cross-sectional  study,  which  uses  observations  collected  from  a 
large  or  representative  portion  of  the  population  within  a  short  time  frame.  With  recent  advances 
in  sensing  technology,  there  will  likely  be  a  large  increase  in  cross-sectional  data  in  various 
domains,  and  it  would  be  great  if  they  can  be  used  not  only  in  cross-sectional  study  but  also  to 
aid  longitudinal  study. 


1.1  Thesis  Summary 

Motivated  by  challenges  in  time  series  data  collection  for  a  variety  of  modem  dynamic  modeling 
tasks,  we  propose  and  study  several  methods  for  learning  various  dynamic  models  using  non- 

1  http://www.sdss.org/ 
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sequence  data  that  lack  time  information  but  are  easy  to  obtain.  Table  1.1  summarizes  our  thesis 
work  and  contributions.  In  brief,  we  consider  learning  two  classes  of  dynamic  models:  first-order 
observable  models  and  hidden  Markov  models  (HMMs),  under  two  conditions  on  the  input  data. 
When  the  input  data  consists  of  both  sequence  and  non-sequence  samples,  our  proposed  methods 
use  non-sequence  data  as  regularization  to  existing  sequence-only  learning  methods,  and  achieve 
significant  improvement  when  sequence  data  is  few.  In  the  more  challenging  situation  where  all 
the  input  data  are  non-sequence,  our  methods  for  learning  first-order  observable  models  maxi¬ 
mize  approximate  likelihood  functions  via  EM-type  procedures,  and  obtain  encouraging  results 
on  synthetic  data  as  well  as  several  real  data  sets,  including  gene  expression  data  and  cell  im¬ 
ages.  For  HMMs,  we  take  advantage  of  recent  advances  in  spectral  learning  [Anandkumar  et  al., 
2012a]  and  identify  reasonable  generative  assumptions  on  non-sequence  data  that  lead  to  spectral 
methods  with  consistent  parameter  learning  guarantees.  To  the  best  of  our  knowledge,  this  is  the 
first  theoretical  statement  on  learning  from  non-sequence  data. 


1.2  Thesis  Overview 

After  surveying  related  work  in  Chapter  2,  we  first  consider  in  Chapters  3  and  4  learning  fully 
observable  dynamic  models.  In  Chapter  3,  we  assume  the  only  data  available  are  snapshots 
taken  from  multiple  instantiations  of  a  dynamic  process  at  unknown  times,  and  the  dynamic  pro¬ 
cess  falls  in  the  class  of  fully  observable,  discrete-time,  first-order  linear  or  non-linear  dynamic 
models.  Acknowledging  several  issues  in  model  identifiability,  we  developed  EM-type  learn¬ 
ing  algorithms  that  maximize  approximate  likelihood  functions,  along  with  novel  initialization 
methods  based  on  the  idea  of  temporal  smoothing.  In  a  number  of  experiments  on  synthetic  and 
real  data  sets  including  gene  expression  data  and  cell  images,  the  proposed  algorithms  are  able  to 
leam  moderately  to  highly  accurate  dynamic  models,  but  at  times  suffer  severely  from  the  model 
ambiguity  inherent  in  this  setting. 

We  thus  in  Chapter  4  consider  slightly  stronger  assumptions:  in  addition  to  non-sequence 
data,  a  small  amount  of  sequence  data  are  also  available.  We  restrict  the  class  of  dynamic  mod¬ 
els  to  first-order  discrete-time  stable  vector  auto-regressive  (VAR)  models,  and  assume  the  non¬ 
sequence  data  are  independent  samples  drawn  from  the  stationary  distribution  of  the  VAR  model. 
The  latter  assumption  is  valid  when,  for  example,  snapshots  are  taken  from  multiple  trajectories 
of  a  VAR  process  after  they  have  reached  stationarity.  Based  on  these  assumptions,  we  proposed 
learning  algorithms  that  minimize  a  new  penalized  least  square  objective,  which  incorporates 
non- sequence  data  in  a  novel  regularization  term  that  quantifies  violation  of  the  Lyapunov  equa¬ 
tion  relating  the  autoregressive  model  to  the  covariance  of  its  stationary  distribution.  Experiments 
demonstrate  that  when  the  amount  of  sequence  data  is  small,  our  proposed  method  of  exploiting 
non-sequence  data  can  significantly  improve  over  standard  learning  algorithms,  which  use  only 
the  sequence  data. 

Although  fully  observable  models  like  VAR  are  useful,  in  many  applications  only  a  subset 
of  the  variables  in  the  underlying  dynamical  system  can  be  observed.  Thus  in  Chapters  5  and 
6  we  turn  to  learning  dynamic  models  with  hidden  states.  At  first  glance  this  seems  formidable 
because  even  when  sequence  data  are  available,  learning  hidden-state  models  is  in  general  dif¬ 
ficult  both  statistically  and  computationally.  However,  an  emerging  line  of  research  in  machine 
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learning,  known  as  spectral  learning,  has  recently  developed  statistically  consistent  and  computa¬ 
tionally  efficient  algorithms  for  learning  from  sequence  data  perhaps  the  most  widely-used  class 
of  hidden-state  models,  hidden  Markov  models  (HMMs)  [Anandkumar  et  al.,  2012b;  Hsu  et  al., 
2009;  Siddiqi  et  al.,  2010;  Song  et  al.,  2010].  Unlike  traditional  EM-based  learning  methods, 
which  are  vulnerable  to  bad  local  optima,  these  new  methods  are  based  on  spectral  decomposi¬ 
tion,  such  as  Singular  Value  Decomposition  (SVD),  of  empirical  moments  computed  from  data, 
and  therefore  result  in  unique,  local-minima  free  estimates  of  model  parameters,  allowing  formal 
statistical  guarantees  to  be  established.  Building  on  these  recent  advances,  we  propose  spectral 
algorithms  for  learning  HMMs  that  exploit  non-sequence  data. 

In  Chapter  5  we  consider  the  case  where  only  non-sequence  data  are  available.  However, 
unlike  in  Chapter  3  where  all  the  data  points  are  assumed  to  have  the  same  initial  condition,  here 
we  need  multiple  sets  of  non-sequence  data,  each  generated  from  a  different  initial  hidden-state 
distribution.  The  main  contribution  of  this  chapter  is  to  identify  conditions  on  the  initial  hidden- 
state  distributions,  by  drawing  connections  to  spectral  learning  of  Latent  Dirichlet  Allocation 
(LDA)  models  [Anandkumar  et  al.,  2013],  as  well  as  distributional  assumptions  on  the  missing 
time  information  that  allow  us  to  develop  spectral  algorithms  with  formal  guarantees  on  HMM 
parameter  learning.  To  the  best  of  our  knowledge,  these  are  the  first  theoretical  guarantees  in 
learning  from  non-sequence  data.  Compared  with  EM-based  methods  in  simulation,  our  spectral 
algorithms  perform  significantly  better  in  parameter  estimation. 

Then  in  Chapter  6  we  look  at  the  situation  where,  as  in  Chapter  4,  some  sequence  data  are 
available  and  the  non-sequence  data  consist  of  independent  samples  from  the  stationary  distri¬ 
bution  of  the  underlying  HMM.  Extending  state-of-the  art  spectral  algorithms  for  learning  ob¬ 
servable  representation  of  HMMs  [Hsu  et  al.,  2009;  Siddiqi  et  al.,  2010;  Song  et  al.,  2010],  our 
proposed  methods  obtain  improved  estimates  of  lower-order  moments  by  minimizing  estimation 
error  on  the  sequence  data  plus  a  regularization  term  on  the  non-sequence  data,  and  then  apply 
spectral  decomposition  to  the  improved  moment  estimates.  Interestingly,  although  the  high-level 
idea  is  similar  to  that  of  Chapter  4  and  HMMs  are  more  complex  models  than  VARs,  the  opti¬ 
mization  problems  in  this  chapter  turn  out  to  be  convex  whereas  the  ones  in  Chapter  4  are  non- 
convex.  Experiments  on  simulated  data  and  sensor  recordings  of  human  activities  demonstrate 
improvement  over  existing  sequence-only  spectral  algorithms. 

In  the  final  part  of  the  thesis,  Chapter  7,  we  return  to  the  traditional  setting  of  learning  from 
sequence  data  and  focus  on  learning  structured  vector  auto-regressive  models.  Although  this 
chapter  is  not  directly  related  to  the  main  theme  of  the  thesis,  the  methodology  developed  here 
can  aid  learning  in  the  non-sequence  setting  through  its  estimated  structure  of  the  VAR  model, 
which  may  guide  the  design  of  the  regularization  terms  in  the  proposed  EM-type  methods  (Chap¬ 
ter  3)  when  applied  to  non-sequence  data  in  the  same  domain.  We  are  motivated  by  problems 
in  biological  time  series  analysis,  where  dependency  graph  and  clustering  of  variables,  such  as 
expression  levels  of  genes,  are  two  of  the  most  commonly  sought  structures.  In  spite  of  be¬ 
ing  closely  related,  these  two  structures  are  usually  estimated  in  separate  procedures.  We  thus 
propose  a  fully  Bayesian  approach  to  simultaneous  learning  of  these  two  structures  for  vector 
auto-regressive  models,  using  a  novel  bi-clustered  and  sparsity-promoting  prior  for  the  transition 
matrix  and  an  efficient  blocked  Gibbs  sampling  procedure  for  posterior  inference.  Applied  to  a 
T-cell  activation  gene  expression  time  series  data  set  [Rangel  et  al.,  2004],  this  new  method  finds 
a  more  biologically  meaningful  clustering  of  genes  than  state-of-the  art  gene  expression  time 
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series  clustering  methods. 

This  thesis  contains  our  published  work  in  several  venues: 

•  Chapter  3  [Huang  and  Schneider,  2009;  Huang  et  al.,  2010] 

•  Chapter  4  [Huang  and  Schneider,  2011] 

•  Chapter  5  [Huang  and  Schneider,  2013b] 

•  Chapter  6  [Huang  and  Schneider,  2013  a] 

•  Chapter  7  [Huang  and  Schneider,  2012] 
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Chapter  2 
Related  Work 


In  a  good  number  of  applications,  a  critical  issue  is  to  understand  the  dynamics  or  temporal  de¬ 
pendency  underlying  observed  data  that  lack  temporal  or  sequential  information.  As  a  result, 
various  methods  were  proposed  independently  in  different  areas,  but  to  the  best  of  our  knowl¬ 
edge,  no  prior  work  studies  the  general  problem  of  learning  dynamic  models  from  non-sequence 
data  as  comprehensively  as  this  thesis.  In  this  chapter  we  survey  several  such  applications  and 
briefly  explain  the  methods  developed  therein. 

As  mentioned  in  Chapter  1,  cell  imaging  has  become  a  useful  tool  for  studying  certain 
types  of  cell  dynamics,  such  as  variation  in  protein  subcellular  localization  during  the  cell  cycle 
[Buck  et  al.,  2009].  Instead  of  relying  on  time-series  cell  images  as  in  most  previous  studies, 
Buck  et  al.  [2009]  propose  to  utilize  static,  asynchronous  snapshots  taken  from  multiple  cells  at 
various  phases  of  the  cell  cycle  because,  as  quoted  in  Chapter  1,  such  images  are  easier  to  obtain 
on  a  large  scale  than  time-series  images.  Their  approach  is  to  first  extract  a  one-dimensional 
surrogate  of  cell  cycle  time  from  static  cell  image  features  by  manifold  learning  techniques1,  and 
then  use  this  surrogate  in  place  of  cell  cycle  time  for  subsequent  cell-cycle  dependence  tests. 
Through  analysis  of  real  data,  they  confirm  that  such  a  surrogate  is  well  correlated  with  the  cell 
cycle.  However,  they  did  not  perform  explicit  dynamic  modeling,  i.e,  building  models  to  predict 
future  observations. 

A  closely  related  problem  studied  in  a  number  of  disciplines  is  that  of  ordering  a  set  of  ob¬ 
jects.  Depending  on  the  domain  of  interest,  an  ordering  can  be  interpreted  as  progression  of 
time,  some  coherent  sequential  structure  or  monotonic  property.  In  natural  language  processing, 
the  task  of  multi-document  summarization  requires  ordering  of  sentences  selected  from  differ¬ 
ent  documents,  and  automatic  title  generation  techniques  construct  a  headline  by  selecting  and 
ordering  words  from  the  input  text  [Barzilay  and  Elhadad,  2002;  Deshpande  et  al.,  2007].  In 
multimedia  analysis  and  retrieval,  automatic  generation  of  video  or  slideshow  from  photos  in¬ 
volves  laying  down  a  coherent  and  smoothly  transitioning  sequence  of  scenes  [Chen  et  al.,  2006; 
Hua  et  al.,  2004].  Some  of  the  techniques  developed  for  these  tasks  are  tailored  to  a  specific 
problem  domain,  and  most  of  them  have  access  to  some  external  knowledge  about  orderings 

'Manifold  learning  techniques  have  been  used  in  dynamic  model  learning  to  identify  a  subspace  where  the 
dynamics  reside,  leading  to  more  accurate  models.  See,  for  example,  [Boot  and  Gordon,  2011]  and  references 
therein.  Similar  techniques  can  be  used  in  combination  with  our  proposed  methods  as  a  pre-processing  step  to  make 
the  problem  lower-dimensional  and  thus  easier. 
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of  objects,  such  as  time  stamps  of  photos  or  grammatical  rules  for  sentence  compositions.  In 
contrast,  we  consider  a  more  general  problem  setting  which  relies  on  no  or  little  domain  specific 
knowledge,  though  our  proposed  methods  make  more  explicit  model  assumptions. 

The  computational  biology  community  has  also  studied  the  problem  of  ordering  objects,  in 
the  context  of  finding  a  temporal  ordering  of  static,  asynchronous  microarray  measurement  data 
[Gupta  and  Bar-Joseph,  2008;  Magwene  et  al.,  2003].  The  proposed  methods  therein  are  less 
domain  dependent  and  fall  in  a  large  family  of  algorithms  for  solving  the  curve  reconstruction 
problem ,  which  has  been  studied  in  various  fields  such  as  computational  geometry  (e.g.,  Giesen 
[1999]),  statistics  [Hastie  and  Stuetzle,  1989],  and  machine  learning  [Smola  et  al.,  2001].  More 
specifically,  Magwene  et  al.  [2003]  proposed  to  reconstruct  the  temporal  ordering  of  microarray 
samples  through  finding  the  minimum  spanning  tree  on  the  graph  formed  by  the  sample  points, 
while  Gupta  and  Bar-Joseph  [2008]  proposed  to  solve  an  instance  of  the  traveling  salesman  prob¬ 
lem  (TSP)  and  proved  that  under  certain  conditions  on  the  dynamics  generating  the  samples,  the 
optimal  TSP  path  accurately  reconstructs  the  true  ordering.  A  key  assumption  behind  these  two 
methods  is  that  temporally  close  sample  points  should  also  be  spatially  close.  Both  of  these 
methods  are  unable  to  choose  an  overall  direction  of  time,  a  limitation  due  to  the  invariance  to 
time  direction  in  their  objective  functions.  Our  problem  setting  differs  from  the  aforementioned 
in  that  we  consider  snapshots  from  multiple  trajectories  of  some  dynamic  process  rather  than  out- 
of-order  samples  from  a  single  sequence.  Moreover,  we  focus  more  on  learning  a  model  for  the 
underlying  dynamics  than  ordering  the  data  points.  Although  the  non-sequence  data  considered 
in  our  settings,  as  formalized  in  later  chapters,  can  be  ordered  based  on  their  unobserved  time 
stamps,  such  an  ordering  may  not  be  very  useful  to  existing  dynamic  model  learning  methods 
because  these  methods  require  as  input  sequences  tracking  the  same  instances  over  time.  Nev¬ 
ertheless,  ordering  objects  is  still  a  useful  component  in  our  proposed  methods  in  Chapter  3,  but 
the  objects  being  ordered,  instead  of  raw  data  points,  are  some  representative  points  discovered 
by  clustering  algorithms. 

Another  problem  involving  learning  dynamic  models  without  temporal  ordering  is  the  net¬ 
work  structure  inference  problem  considered  by  Rabbat  et  al.  [2008].  The  authors  point  out  that 
in  many  situations,  ranging  from  telecommunication  network  tomography  problems  to  construc¬ 
tion  of  biological  signal  pathways  or  social  networks,  the  goal  is  to  reconstruct  a  directed  graph 
representing  the  underlying  network  structure,  but  the  only  available  data  are  sets  of  nodes  co¬ 
occurring  in  random  walks  on  the  graph  without  the  order  in  which  they  were  visited.  These 
problem  can  be  cast  as  learning  a  first-order  Markov  chain  from  data  lacking  ordering  informa¬ 
tion.  To  avoid  the  exponential-time  complexity  of  enumerating  all  possible  orderings,  the  authors 
propose  a  polynomial-time,  importance  sampling  based  EM  algorithm  with  convergence  guar¬ 
antee  to  estimate  the  parameters  of  the  Markov  chain.  Inspired  by  Rabbat  et  al.  [2008],  several 
researchers  in  computational  linguistics  study  the  problem  of  learning  a  bi-gram  language  model 
from  the  commonly-used,  order-invariant  bag-of-words  representation  of  text  corpus  [Zhu  et  al., 
2008],  and  develop  a  similar  sampling-based  EM  algorithm.  While  empirically  successful  to 
some  extent,  these  algorithms,  like  most  EM  procedures,  do  not  have  guarantees  on  the  quality 
of  their  parameter  estimates.  Very  recently,  Gripon  and  Rabbat  [2013]  propose  a  combinatorial 
algorithm  for  graph  reconstruction  from  co-occurrence  data  and  provide  some  theoretical  guar¬ 
antees  on  the  reconstruction  accuracy.  However,  their  results  apply  only  to  undirected  graphs 
and  require  the  input  to  the  algorithm  to  be  the  exact  set  of  triples  of  nodes  that  are  connected  but 
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cycle-free  in  the  graph.  In  Chapter  5  we  also  study  the  problem  of  learning  first-order  Markov 
chains  from  data  lacking  temporal  information.  However,  instead  of  data  with  hidden  order¬ 
ings,  we  consider  data  drawn  from  multiple,  independent  trajectories  of  the  underlying  Markov 
chain,  so  there  was  no  ordering  to  begin  with.  At  first  glance,  learning  in  this  setting  may  seem 
more  difficult  than  in  the  hidden-ordering  setting,  but  as  detailed  in  Chapter  5,  the  independence 
assumption  in  our  setting  actually  makes  learning  easier. 

In  addition  to  the  above  general  problem  areas,  there  are  two  specific  problems  we  find  rel¬ 
evant  to  our  work.  One  is  collective  inference  on  Markov  models  [Sheldon  et  al.,  2008],  which 
finds  the  most  likely  collection  of  paths  on  a  trellis  graph  given  observations  on  the  collective  be¬ 
havior  of  a  group  of  dynamic  objects.  Their  motivation  was  to  trace  out  trajectories  of  individual 
birds  from  aggregate  statistics  of  an  entire  species  of  migrating  birds.  The  other  is  connecting  the 
dots  between  news  articles  [Shahaf  and  Guestrin,  2010],  which  aims  to  build  a  chronological  and 
coherent  story  line  of  news  that  connects  a  given  pair  of  starting  and  end  articles,  thereby  pro¬ 
viding  readers  a  detailed  description  of  the  causal  relationship  between  two  events.  A  common 
feature  in  both  problems  is  the  need  of  identifying  structures  of  sequentially  matched  objects 
from  partially  ordered  data.  A  similar  situation  arises  in  one  component  of  our  methods,  where 
the  data  points  are  put  into  ordered  clusters  for  further  processing  (Section  3.3).  But  instead  of 
finding  hard  matchings  between  data  points  in  adjacent  clusters,  we  take  a  soft-matching  type  of 
approach,  updating  the  soft  matching  and  the  dynamic  model  alternatingly. 

While  our  focus  is  on  learning  from  data  lacking  time  or  ordering  information,  another  com¬ 
mon  problem  involving  time  in  dynamic  modeling  is  the  misalignment  of  time  measurements 
across  multiple  sequences  of  observed  data,  due  to  internal  variation  of  the  dynamic  process  of 
interest  or  measurement  error.  This  problem  arises  in  many  time  series  modeling  tasks,  such  as 
speech  recognition  [L.  Rabiner,  1993;  Vintsyuk,  1968],  analysis  of  gene  expression  time  series 
[Aach  and  Church,  2001],  activity  recognition  [Junejo  et  al.,  2011],  and  audio  information  re¬ 
trieval  [Chapter  4,  Muller,  2007],  bringing  forth  a  large  body  of  research,  known  in  statistics  as 
curve  registration  [Ramsay  and  Li,  1998;  Silverman,  1995]  and  in  computer  sciences  as  dynamic 
time  warping  [Bemdt  and  Clifford,  1994;  Keogh  and  Ratanamahatana,  2005].  The  general  idea 
in  these  works  is  to  first  postulate  a  class  of  possible  time  transformations  or  warping  operations, 
and  then  recover  the  most  likely  warping  operation  for  each  observation  by  optimizing  some 
global  matching  score  across  all  the  data  sequences.  The  final  result  is  time-warped  sequences 
of  observations  that  are  in  better  alignment  with  one  another.  While  not  directly  related  to  our 
thesis  focus,  these  methods  can  potentially  aid  our  work  in,  for  example,  an  iterative,  EM-like 
manner,  where  time  stamps  and  dynamic  models  are  alternatingly  re-estimated  given  the  other 
until  convergence. 

Finally,  we  briefly  mention  where  our  work  lies  in  the  vast  space  of  research  on  dynamical 
systems  conducted  in  physics  and  mathematics.  Most  dynamical  theories  are  concerned  with 
the  asymptotic  behavior  of  some  dynamical  system,  under  various  assumptions  on  the  phase  or 
state  space  of  the  system  and  the  short-time  evolution  law  [Katok  and  Hasselblatt,  1996].  But 
our  work  studies  in  some  sense  the  reverse  problem,  that  is,  given  observations  that  reflect  the 
global  status  of  a  dynamical  system,  we  try  to  develop  methods  that  figure  out  the  short-time  or 
local  evolution  law. 
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Chapter  3 

Learning  Fully  Observable  Models  From 
Non-sequence  Data 


In  this  chapter,  we  are  interested  in  learning  first-order,  discrete-time,  fully  observable  linear 
dynamic  models  described  by  the  following  transition  function: 

x(w)  =  dx(t)+e(w),  (3.1) 

where  x®  e  Rpx  1  is  the  state  vector  at  time  t,  A  e  Wxp  is  the  state  transition  matrix,  and  e(t) 
is  the  noise  vector  at  time  t.  Such  a  model  is  also  known  as  a  first-order  vector  auto-regressive 
model  (VAR)  in  the  time  series  literature.  For  simplicity,  we  assume  hereafter  that  Vt,  e(l>  ~ 
A /(•  |  0,  a2/),  a  Gaussian  distribution  with  zero  mean  and  covariance  a2 1,  where  /  is  the  identity 
matrix.  However,  the  proposed  methods  in  later  sections  all  can  be  extended  to  handle  general 
covariance  matrices.  The  dynamical  system  also  has  a  start  state,  which  we  denote  as  x(0).  Thus, 
the  linear  dynamic  models  we  consider  are  fully  characterized  by  ©  =  {A,  cr2,  x!0)}. 

When  sequenced  observations  are  available,  a  basic  learning  method  is  least-square  linear 
regression  of  the  observations  at  time  t  on  the  observations  at  time  t  —  1,  whose  properties 
have  been  studied  extensively  (see  e.g.,  [Hamilton,  1994]).  The  problem  without  observed  state 
sequences  is  much  more  difficult.  We  assume  that  n  executions  of  the  dynamic  model  (3.1)  have 
taken  place,  and  from  each  execution  we  have  observed  a  single  data  point  drawn  at  random  from 
the  sequence  of  states  generated  in  that  execution.  The  result  is  n  data  points,  (xi, . . . ,  x„},  each 
from  a  different  trajectory  and  having  occurred  at  an  unknown  point  in  time.  To  avoid  confusion 
in  indices,  hereafter  we  use  parenthesized  super-script,  e.g.,  x^,  to  denote  the  time  index,  but 
sub-script,  e.g.,  x,,  to  denote  the  data  index.  A  precise  description  of  this  generative  process  is 
given  in  Algorithm  3.1  along  with  a  graphical  illustration. 

We  focus  on  estimating  A  and  a2,  and  treat  the  start  state  xm)  as  a  nuisance  parameter. 
For  an  observation  x,;,  if  its  immediate  predecessor  x,  is  known,  then  the  likelihood  is  simply 
A /(xj  |  Ax*,  a2 1).  But  Xj  is  unknown,  so  we  integrate  it  out  with  respect  to  the  distribution  one 
time  step  earlier  than  x$  and  obtain  the  following  likelihood: 

//  ||xj— Ax|||\ 

6XP  ~  g  V(x  |  (3.2) 

[2ncr2)  2 
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Algorithm  3.1  Sampling  from  multiple  trajectories 

1 

Input:  transition  matrix  A,  a2,  x(0\  Tmax,  and  n 

2 

for  i  —  1  to  n  do 

4  1 
■  t3 

3 

Pick  a  random  time  stamp  ti  from  {1, . . . ,  Tmax}. 

4 

for  t  —  1  to  ti  do 

*  /  2 

5 

x«  <-  Ax^  +  eW  e«  ~  Af(-\0,a2I). 

/  ■■  /^5 

6 

end  for 

•  .-x-j  *A 
x.i .  .• 

7 

Set  Xj  =  x^'3. 

8 

9 

end  for 

Output:  A  sample  xi,  x2, . . . ,  x„. 

-.X5 

..Av" 

where  t*  denotes  the  true  but  unknown  time  of  x,,  ||  ■  ||2  is  the  vector  two-norm,  and  the  predeces¬ 
sor  distribution,  by  the  closure  of  Gaussian  under  linear  transformation,  is  Gaussian  with  mean 


!)  and  covariance  E^*  where 


(t)  ,=  ^4*x(o)  E(t)  •=  ’  t-1’ 

'  0.  t  =  0. 


(3.3) 


Since  the  n  data  points  are  drawn  independently,  we  can  factorize  the  likelihood  of  the  sample 
points  as 

n 

L(x1; . . .  ,x„  |  0,ti,  ...,tn)  =  JjL(xi|0,fi).  (3.4) 

2=1 

The  maximization  of  (3.4)  is  a  challenging  task  because,  as  suggested  by  (3.3),  the  transition 
matrix  A  appears  in  (3.4)  as  polynomials  whose  degrees  depend  on  the  missing  time  indices 
Vs.  In  the  following  sections  we  proposed  methods  that  avoid  this  difficulty  by  various  approx¬ 
imations  to  (3.4),  but  before  presenting  our  proposed  methods,  we  first  discuss  several  possibly 
non-identifiable  properties  of  the  model  when  the  true  temporal  information  is  missing. 


3.1  Identifiability  Issues 


Consider  a  simple  linear  dynamic  model  with  the  following  transition  matrix  and  initial  point: 


cos  (^) 

-sin  (f)' 

x(°)  - 

"1" 

sin  (^) 

cos  (y) 

)  x  — 

0 

The  ideal  trajectory  rolled  out  by  this  simple  dynamic  model  lies  on  the  unit  circle  in  the  two- 
dimensional  Euclidean  space.  Suppose  we  observe  a  set  of  points  from  the  ideal  trajectory,  but 
do  not  know  their  time  indices.  It  is  easy  to  see  that  all  of  the  following  dynamic  models: 


A(t) 


"cos  (2^)  -sin 
sin(^)  cos(^)  J  ’ 


t  e{±!,±2,...,±(T-l)}, 


as  illustrated  in  Table  3.1,  would  explain  the  data  equally  well  under  any  reasonable  measure  of 
goodness  of  fit.  In  the  presence  of  process  noise,  some  of  these  models  may  become  less  likely, 
but  it  would  still  be  hard  to  uniquely  determine  the  true  dynamic  model. 
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Table  3.1:  An  example  demonstrating  unidentifiability  of  time  direction  and  speed 


Data 


Three  equally  possible  dynamic  models 


Table  3.2:  An  example  demonstrating  general  unidentifiability. 


Data 


Model  1 


Model  2 


The  above  example  suggests  two  possibly  non-identifiable  properties  of  the  model:  the  over¬ 
all  direction  in  time  and  the  speed  of  the  underlying  dynamics.  In  fact,  Peters  et  al.  [2009] 
showed  that  under  some  linear  dynamic  models  the  true  direction  in  time  is  not  identifiable.  The 
methods  proposed  in  subsequent  sections  thus  do  not  resolve  these  ambiguities;  the  leamt  model 
may  follow  either  of  the  two  directions  in  time,  but  usually  corresponds  to  the  slowest  dynamics. 

Another  perhaps  more  intriguing  example  is  depicted  in  Table  3.2,  which  presents  a  non- 
sequenced  and  noiseless  data  set  in  the  left  column  and  two  possible  dynamic  models  in  the  right 
column.  On  the  one  hand,  according  to  our  assumption  of  a  single  fixed  start  state  as  in  Algorithm 
3.1,  Model  1  should  be  favored  over  Model  2  under  any  reasonable  measure  of  goodness  of  fit 
that  incorporates  such  an  assumption.  On  the  other  hand,  under  a  certain  level  of  noise  and/or 
some  non-uniform  sampling  rate  in  the  temporal  domain,  the  data  generated  from  Model  1  may 
be  more  similar  to  a  cylinder  than  to  a  spiral,  making  Model  2  equally  or  even  more  likely  to 
have  generated  the  data.  There  are  more  examples  of  this  type,  such  as  a  torus  of  points  where 
rotations  around  the  short  and  the  long  circumferences  can  hardly  be  distinguished  from  each 
other  in  the  absence  of  any  temporal  information.  Theoretical  investigation  into  such  issues  as 
conditions  under  which  these  ambiguities  can  or  cannot  be  resolved  is  thus  an  important,  but 
challenging  future  direction. 
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3.2  Approximate  Likelihood  and  Expectation  Maximization 

We  present  three  methods  for  estimating  A  and  a2  in  Sections  3.2. 1  to  3.2.3,  based  on  maximizing 
various  approximations  to  the  likelihood  (3.4).  The  optimization  is  carried  out  by  Expectation 
Maximization  (EM)  types  of  algorithms.  Then  Section  3.2.4  demonstrates  extensions  of  these 
three  methods  for  learning  nonlinear  dynamic  models,  which  make  use  of  reproducing  kernels. 


3.2.1  Unordered  Approximation 


We  first  remove  the  problem  of  unknown  time  indices  by  marginalizing  out  the  missing  Vs- 
According  to  the  generative  process  in  Algorithm  3.1,  the  distributions  of  Vs  are  independent 
from  A  and  a2,  and  also  mutually  independent.  Let  P(t;)  denote  the  probability  mass  function 
of  t,  e  Tmax}.  We  then  write 


L  max  J-  max 


L(xi, . . .  ,xn|0)  :=  y~]  •  •  •  y~]  L(x  1, . .  ■  ,xn,fi, . .  ,,tn\0) 

0=i  tn= i 

^max  Tmax  /  Tl 

=  E"'E 

0  =  1  t»  =  1  \|=1 

P  Tmax 

=  tt  yy  L(xi\6,ti)p(ti). 

i=  1  0=1 

Plugging  in  the  conditional  likelihood  (3.2),  we  obtain 

n  Tmax  /  „  /  ||xj-Ax||2  ^ 


i -l  max  /  r> 

L(Xl,...,xn I©)  =  ns  / 

i=  1  ti= 1  V 
n 

=  n 


exp(- 


exp(- 


(2707 

||xj— Axj,[l 
2a2 


2^9 


^ — -A/"(x|//t<_1\  E^i_1^)dx^  Pfe) 


y 


AT(x|/x^-1\  E^-^PV)  dx  (3.5) 


vfc=l 


(27r<r2)  2 

In  the  case  of  P(V)  =  1/Tmax,  be.  Vs  are  uniformly  distributed,  and  Tmax  is  large,  we  have 


^AT(x|^i”1),E(t^1))P(^ 


0=1 


AT(x|/x(ti-1),E^-1)) 

0=1 


T 

1  rr 


E 

0=1 


A/^xl/V^,  E^)) 


T 

-L  Tr 


(3.6) 


which  is  the  density  that  the  data  points  (xi, . . .  ,xn}  are  generated  from.  This  gives  another 
view  of  the  generative  process:  a  random  sample  point  is  drawn  (approximately)  from  (3.6)  and 
an  observation  is  created  by  applying  (3.1)  to  it.  However,  (3.6)  still  depends  on  the  unknown 
Vs.  To  remove  this  dependency,  we  replace  (3.6)  with  its  empirical  estimate  given  by  the  sample 
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(3.7) 


points  we  have.  This  together  with  (3.5)  leads  to  the  following  approximate  likelihood: 


exp(- 


||  Xi~Ax.j 


2<r2 


L(x1,...,xn|0)  :=  H 

\  j %  (n~  l)(27T(j2) 2 


i=  1 


We  exclude  the  case  that  x,  generates  itself  to  avoid  the  degenerate  estimate  A  =  I.  To  avoid 
overfitting,  we  impose  a  zero-mean  Gaussian  prior  on  A  with  precision  XI  and  an  inverse  Gamma 
prior  on  a2  with  shape  and  scale  parameters  a  and  /3,  leading  to  the  approximate  log-posterior: 


logPUM(0|xi, . . .  ,xn)  :  = 


(a  +  1)  log  cr2  - 


u 


(3.8) 

This  is  the  representative  form  of  the  objective  our  first  method  aims  to  maximize.  Later  in  the 
experiments  we  may  use  variants  of  (3.8)  such  as  allowing  more  general  noise  variances,  but  all 
the  associated  methods  can  be  easily  derived  from  the  methods  based  on  (3.8),  which  we  present 
below.  Since  there  is  no  notion  of  ordering  involved  in  (3.8),  we  refer  to  it  as  the  Unordered 
Approximation. 

Before  introducing  our  optimization  algorithm,  we  point  out  that  (3.8)  considers  the  data 
points  as  if  each  x,  were  generated  from  some  other  x?  in  the  sample  by  (3.1).  However,  accord¬ 
ing  to  Algorithm  3.1  no  x,  was  generated  from  any  other  x;  in  the  sample.  Such  a  discrepancy  is 
due  to  our  replacing  (3.6)  with  its  empirical  estimate,  and  an  immediate  consequence  is  that  a2 
in  (3.8)  now  accounts  for  not  only  the  noise  e  in  the  dynamic  model  (3.1),  but  also  the  approxi¬ 
mation  error  introduced  by  replacing  (3.6)  with  the  empirical  density. 

To  present  our  learning  algorithm,  we  observe  that  the  likelihood  (3.7)  is  a  product  of  summa¬ 
tions  of  Gaussian  densities.  This  structure  is  also  shared  by  the  likelihood  of  Gaussian  Mixture 
Models  (GMM),  for  which  Expectation  Maximization  (EM)  algorithms  are  the  common  choice 
for  estimation.  Although  (3.8)  is  not  a  GMM,  its  similar  structure  allows  us  to  derive  an  EM 
procedure  with  analytical  update  rules.  We  first  introduce  a  latent  variable  matrix  Z  e  (0,  l}nxn 
such  that 


Zij  - 


1,  x,;  was  generated  from  x 

0,  otherwise 


Zn  —  0,  Zij  —  1. 

3= 1 


(3.9) 


Again,  here  “x,  was  generated  from  x/’  is  to  be  taken  as  an  approximation  due  to  our  replacing 
(3.6)  with  the  data.  We  then  rewrite  (3.8)  using  the  standard  variational  equation  (c.f.  Section 
9.4,  [Bishop,  2006]): 

logPUM(0|xi, . . .  ,x„)  =  log  ypw(0,z^l7  ■  ■  ■  ,xn),  (3.10) 

z 


where 

PUM(0,  Z|xi, . . .  ,x„) 


exp 


>ii 


-a 


2(o+l)  _ 
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Algorithm  3.2  Expectation  Maximization  for  (3.8) 

Input:  Data  points  x1; . . . ,  xn 
Initialize  A(0)  and  af0) ,  set  k  =  0 

repeat 

Update  Z(fc+i)  by  (3.11)  with  and  afk) 

Update  A(fc+i)  by  (3.13)  with  Z(fc+ 1)  and  cr2fc) 

Update  cr(2fc+1)  by  (3.14)  with  A(fc+1)  and  Z(fe+1) 
k  <—  k  +  1 

until  The  approximate  log  posterior  (3.8)  does  not  increase 


is  referred  to  as  the  complete  posterior.  Following  the  standard  EM  derivation,  in  the  E-step  we 
compute  the  posterior  probability  of  Z: 


Q(Z |0,xi,. . .  ,xn)  := 


PUM(0.  Z\xu  . . .  ,xT, 
PUM(0|xi, . . .  ,xn) 


which  simplifies  as 


Zij  :=  Q(Zij  =  l\0,x1,...Jxn)  =  <E 

u 


exn  ( —  llxi'AxsH2 
;#iexp(  ^2 


(3.11) 


In  the  M-step  we  maximize  the  expectation  of  the  log  complete  posterior  with  respect  to  the 
posterior  probability  Q(Z\0,  xi, . . . ,  xn): 

max  y''  Q(Z\6,  xi, . . . ,  xn)  log  Pvu(9\  Z\xi, . . . ,  xn) 

n'  z. — j 


i=i  j= i 


—  o^Xjl1  +  ?  M2™2)!  -  ttPIIf  -  («  +  l)loga2  -  4,  (3.12) 


(3.13) 

(3.14) 


A  summary  of  the  EM  procedure  is  given  in  Algorithm  3.2.  Note  that  (3.14)  can  be  easily 
generalized  to  handle  general  covariance  structures. 

According  to  (3.11)  and  (3.12),  we  can  view  Algorithm  3.2  as  a  version  of  the  iteratively 
re- weighted  least  square  (IRLS)  procedure.  Although  it  is  simple  and  often  computationally 
efficient,  one  may  worry  that  without  enforcing  any  directional  consistency  in  the  Z^s  the  EM 
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20 


(a)  True  dynamics 


(b)  Degenerate  estimate 


Figure  3.1:  Degenerate  estimate  by  the  unordered  approximation  on  200  data  points 

algorithm  may  lead  to  degenerate  dynamic  models,  especially  when  the  sample  size  is  limited. 
In  fact,  this  happens  in  our  experiments  on  simulated  data.  We  thus  propose  a  variant  of  (3.8) 
that  incorporates  additional  constraints  in  the  next  section. 

Although  our  focus  is  learning  first-order  models,  it  is  worth  noting  that  the  proposed  EM 
algorithm  can  be  easily  generalized  to  learn  higher-order  models.  For  example,  consider  the 
second-order  model 


x(*+2)  =  ^x(*+l)  +  +  e(i+2);  e(‘+2)  ^  7V(.  |  0,  a2 1). 


Using  approximations  similar  to  those  in  (3.7),  we  may  obtain  the  following  unordered  approxi¬ 
mate  likelihood: 


The  corresponding  EM  algorithm  then  involves  a  latent  three-way  tensor  variable  Z  e  {0,  l}nxnxn 
the  E  step  computes  posterior  probabilities  that  x(  is  generated  from  x;/  and  x/,.  for  all  triples 
j  /  j  /  k,  and  the  M  step  solves  weighted  least  square  regression  for  A,  B  and  a2. 

3.2.2  Partially-ordered  Approximation 

As  mentioned  before,  the  replacement  of  the  true  state  space  density  with  the  empirical  density 
results  in  the  approximate  likelihood  (3.7),  where  the  data  points  are  treated  as  if  each  one  were 
actually  generated  from  some  other  one.  What  might  be  more  problematic  is  such  an  approxima¬ 
tion  ignores  the  fact  that  there  is  a  latent  temporal  ordering  induced  by  the  unknown  time  indices 
of  the  data  points,  even  though  the  data  points  are  drawn  independently.  A  possible  consequence 
of  ignoring  the  latent  ordering  is  a  degenerate  estimate  of  the  dynamic  model,  as  illustrated  in 
Figure  3.1,  which  shows  the  true  one-step  displacement  vectors  Ax*  —  x,-  and  the  UM  estimates. 
On  this  data  set,  the  UM  likelihood  of  the  true  model  is  even  worse  than  the  degenerate  estimate. 
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In  our  experiments  on  synthetic  data  (Section  3.4),  we  find  that  such  degeneracy  issues  usually 
arise  when  the  data  is  small,  say  a  few  hundred  points.  Thus  in  our  second  approach,  we  take 
into  account  the  ordering  relation  explicitly.  To  do  so,  we  introduce  a  set  of  weight  parameters 
collectively  denoted  as  an  n-by-n  matrix  uj,  and  modify  the  approximate  likelihood  (3.7) 
as  follows: 

n  n 

L2(x  i,...,xn|0,u>)  = 

2=1,  7  =  1 

i(£S 

in  which  S  :=  {i  :  U  <  tj  Vj}, 

{^ij  —  tj  <  fj, 

c Oij  =  0,  tj  >  ti ,  and 

^ii  0?  V  X, 

The  first  set  of  constraints  in  (3.16)  is  to  force  the  summation  in  (3.15)  to  be  consistent  with 
a  global  direction  of  time,  while  the  normalization  constraints  are  to  maintain  the  notion  of 
approximating  the  true  state  space  density  with  an  empirical  density.  The  set  S  denotes  the  data 
points  that  are  the  earliest  in  time  (hence  cannot  be  generated  from  other  data  points),  which 
can  be  viewed  as  rough  estimates  of  the  first  state.  If  the  underlying  dynamic  model  exhibits  a 
periodic  behavior  (such  as  rotation  on  a  plane),  the  true  first  state  may  not  be  identifiable  but  A 
and  a2  still  may  be.  In  that  case,  S  is  chosen  arbitrarily  and  the  relative  time  offsets  between 
points  may  still  be  correct,  thus  leading  to  reasonable  estimates  of  A  and  a2. 

As  mentioned  before,  the  true  time  indices  Vs  of  the  data  points  are  missing,  and  even  with 
the  above  approximation  (3.15)  and  (3.16)  it  is  still  unclear  how  to  jointly  estimate  them  and 
the  model  parameters.  We  instead  consider  the  cc^’s  as  unknown  parameters  to  be  estimated, 
which  we  interpret  as  decomposing  the  global  ordering  information  into  parameters  of  pairwise 
relations.  Again,  we  make  clear  that  as  in  Section  3.2.1,  here  we  are  also  approximating  the 
likelihood  as  if  each  point  in  the  data  were  actually  generated  from  some  other  point  in  the  data. 
The  set  of  constraints  (3.16)  can  be  restated  only  in  terms  of  uj  as  follows: 

1.  uj  is  non-negative;  each  row  of  uj  sums  to  one  or  zero. 

2.  As  a  weighted  adjacency  matrix,  uj  represents  a  directed  acyclic  graph. 

Note  that  for  both  constraints  to  hold  at  the  same  time,  one  or  more  rows  in  uj  must  sum  to 
zero,  and  the  corresponding  data  points  form  the  set  S.  However,  it  is  hard  to  maximize  (3.15) 
with  respect  to  uj  under  these  constraints  because  they  define  a  non-convex  set.  Moreover, 
Nicholson  [1975]  proved  that  a  weighted  adjacency  matrix  M  contains  no  cycle  if  and  only 
if  permanent  (M  +  I)  =  1,  and  Valiant  [1979]  showed  that  computing  the  matrix  permanent  is 
^P-complete.  We  therefore  consider  a  subset  of  the  previous  two  constraints: 

1.  uj  can  only  take  values  in  (0, 1}. 

2.  As  an  adjacency  matrix,  uj  forms  a  directed  spanning  tree. 

The  new  constraints  turn  the  problem  into  a  combinatorial  one,  which  at  first  glance  seems  even 
more  difficult.  As  we  will  show  below,  the  fact  that  this  discrete  version  is  computationally 
tractable  depends  entirely  on  our  restricting  w  to  be  a  directed  spanning  tree.  Under  the  new 
constraints,  the  set  S  has  only  one  data  point,  which  is  the  root  of  the  directed  spanning  tree. 


exp(- 


||xi—  Axj 


2<r2 


(27T<72 


-UJi 


(3.15) 


l Oij  =  1,  V  i  ^  S. 

3= 1 


(3.16) 
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Algorithm  3.3  Alternating  Maximization  for  (3.17) 

Input:  Data  points  x1; . . . ,  xn. 

Initialize  A(0)  and  a20) ,  set  k  =  0 

repeat 

Construct  the  weight  matrix  W^)  by  (3.24)  with  and  a ^ 

uj(k+ 1)  ^OptimumBranch(IC(fc)) 

Update  A(fc+i)  by  (3.22)  with  cu(fc+i)  and  afk) 

Update  tfk+1)  by  (3.23)  with  A(fc+i)  and  uj^+i) 
k  <—  k  +  1 

until  The  approximate  log  posterior  (3.17)  does  not  increase 


Combining  these  tree-based  constraints  with  the  approximate  likelihood  (3.15)  and  imposing 
the  same  priors  on  A  and  a2  as  before,  we  propose  maximizing  the  following  approximate  log 
posterior  for  estimation: 


max 

A,<t2,u>, 

r£{l,...,n} 


i=i,  j= i 


(/  llx.'—  Ay 

exP(-^U 
(27r<T2)  2 


s.t.  Uij  =  (0, 1}, 


^  U  =  0, 

j=i  i=1 

cc  forms  a  tree  with  root  xr. 


^\\A\\ %  ~  (a  +  1)  log  a2  - 


(3.17) 

(3.18) 

(3.19) 

(3.20) 


We  refer  to  (3.17)  as  the  Partially-ordered  Approximation,  and  with  the  constraints  (3.18)  and 
(3.19)  it  can  be  simplified  as  follows: 


i=  1,  3= 1 

i^r 

n  n 

=ElosII 

i=l  j= 1 


(/  llxi — Ay, 

exP(~^U 
(27ra2)  2 


/exp (-^^)\ 

y  (27T<72)  2  y 


i= 1  i=l 


■  (a  +  1)  log  (T2  - 

cr2 

(a  +  1)  log  a2  - 

<72 

^||A|||  -  (a  +  1)  logo-2 


4-  (3-21) 

tr2 


Interestingly,  this  objective  function  is  in  the  same  form  as  the  expected  log  complete  posterior 
(3.12)  in  Section  3.2.1  with  Zl3  being  replaced  by  uj13.  One  may  thus  view  u  as  the  latent  variable 
Z  in  Section  3.2.1  with  additional  constraints.  However,  there  is  a  subtle  difference:  Z  serves 
only  as  a  means  to  derive  the  EM  algorithm  and  does  not  appear  in  the  maximization  objective 
(3.7)  of  the  Unordered  Approximation,  whereas  explicitly  appears  in  the  optimization  problem 
(3.17)  as  an  unknown  parameter  to  be  estimated. 
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Next  we  discuss  how  to  maximize  (3.17).  Since  (3.21)  has  the  same  form  as  (3.12),  the 
optimal  A  and  a2  under  a  fixed  u>  have  the  same  form  as  (3.13)  and  (3.14): 

n  n  \  /  n  n 

5Z  vijXixl )  [Y1  Y  wvxjx7  + Ar,  J/ 

*=  1  j=l  J  \i=  1  i=l 

2  =  ELiE^i^1|x,-21xj|12  +  2/3 

p(n  -  1)  +  2  (a  +  1) 

When  A  and  a2  are  fixed,  maximizing  (3.21)  with  respect  to  u>  under  (3.18),  (3.19)  and  (3.20) 
is  equivalent  to  finding  the  maximum  spanning  tree  on  a  directed  weighted  graph ,  in  which  each 
data  point  x*  is  a  node,  each  pair  of  nodes  is  connected  in  both  directions,  and  the  weight  on  the 
edge  (i,j)  is 

i^+flo*<2-2))-  <3-24> 

The  problem  of  finding  maximum  spanning  trees  on  directed  graphs  is  a  special  case  of  the  opti¬ 
mum  branchings  problem,  which  seeks  a  maximum  or  minimum  forest  of  rooted  trees  (branch¬ 
ing)  on  a  directed  graph.  Chu  and  Liu  [1965],  Edmonds  [1967],  and  Bock  [1971]  independently 
developed  efficient  algorithms  for  the  optimum  branchings  problem.  The  ones  by  the  former  two 
are  virtually  identical,  and  are  usually  referred  to  as  the  Chu-Liu-Edmonds  algorithm,  for  which 
Tarjan  [1977]  gave  an  efficient  implementation  that  runs  in  0(n2)  time,  where  n  is  the  number 
of  nodes,  for  densely  connected  graphs.  Camerini  et  al.  [1979]  pointed  out  an  error  by  Tarjan 
[1977]  and  provided  a  remedy  retaining  the  same  time  complexity. 

With  these  results,  we  present  an  alternate  maximization  procedure,  Algorithm  3.3  for  max¬ 
imizing  (3.17),  where  OptimumBranch(-)  taking  an  edge-weight  matrix  as  the  input  argument 
uses  an  implementation1  of  Tarjan  [1977]  and  Camerini  et  al.  [1979].  Since  Algorithm  3.3  al¬ 
ways  increases  the  objective  (3.21),  it  converges  to  a  local  maximum. 

3.2.3  Expectation  Maximization  over  Directed  Spanning  Trees 

Recently  in  the  Natural  Language  Processing  community,  researchers  [Globerson  et  al.,  2007; 
Smith  and  Smith,  2007]  have  developed  sum-product  inference  algorithms  for  directed  spanning 
trees,  which  make  use  of  the  matrix  tree  theorem  [Tutte,  1984].  Based  on  their  techniques, 
we  develop  an  EM  procedure  whose  E-step  computes  the  the  expectation  of  the  latent  variable  Z 
over  all  directed  spanning  trees.  Such  a  tree-based  EM  procedure  can  be  viewed  as  being  midway 
between  the  previous  two  methods,  the  first  of  which  averages  out  entirely  the  latent  sequential 
nature  of  the  data,  while  the  second  aggressively  selects  the  single  most  likely  partial  order. 

Consider  the  set  of  spanning  trees,  T(X),  on  the  complete  directed  graph  whose  nodes  are 
the  sample  points.  We  represent  a  spanning  tree  by  its  adjacency  matrix,  whose  rows  correspond 
to  heads  and  columns  correspond  to  tails.  We  also  slightly  abuse  the  notation  Z  to  mean  both  a 
predecessor  matrix  and  the  corresponding  set  of  edges,  so  (i,j)  G  Z  means  Ztj  =  1.  We  then 

'Available  at  http :  / / edmonds-alg .  sourceforge  .  net/ 


(3.22) 

(3.23) 
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maximize  the  following  approximate  log  posterior: 


log  Ptree(A,Cr2\X) 


(3.25) 


p(n-l)  ,  .A,  2  P 

- - - h  (a  +  1)  logcr  -  — , 


cr- 


where  as  before  we  place  a  zero-mean  Gaussian  prior  on  A  with  hyper-parameter  A  and  an  inverse 
Gamma  prior  on  a2  with  hyper-parameters  a  and  (3.  A  major  difference  between  (3.25)  and  the 
unordered  approximate  log  posterior  (3.8)  is  that  the  former  sums  over  “global”  latent  structures, 
i.e.,  spanning  trees,  whereas  the  latter  sums  over  “local”  latent  predecessor  variables  as  shown  in 
(3.10).  We  thus  expect  (3.25)  to  be  more  robust  against  undesirable  local  maxima  than  (3.8). 

To  derive  an  estimation  procedure  based  on  maximizing  (3.25),  we  first  denote  the  posterior 
distribution  over  Z  e  T(X)  by 


Q(Z\A,a2,X) 


Y^z'eT(x)  El (ij)ez'  exP  (_  "  '  2cr2  ) 


(3.26) 


Then,  by  applying  the  standard  variational  equation  we  obtain  the  following  lower  bound: 


> 


log  Ptree{A,(T2\X) 


P 


p(n  —  1) 


+  (a  +  1)  )  logo- 


/lx , 


2a2 


2  o- 


P  ( p(n  -  1) 


hj 


+  (a  +  1)  )  logo-2,  (3.27) 


where 

%  :=  Eq[Z\  =  Y,  H(iJ)£Z}Q(Z\A',(a')2,X).  (3.28) 

zeT(X) 

The  lower  bound  (3.27)  holds  for  all  choices  of  A'  and  {a')2  in  the  posterior  mean  (3.28),  sug¬ 
gesting  an  EM  procedure  that  alternates  between  computing  Z^  and  maximizing  (3.27)  with 
respect  to  A  and  a2. 

For  the  M-step,  the  lower  bound  (3.27),  as  a  function  of  A  and  cr2,  is  in  the  same  form  as  the 
complete  log  posterior  (3.12),  leading  to  update  rules  similar  to  (3.13)  and  (3.14): 


A 


o 


2 


(n  n  \  /  n  n  \ 

%x*x J  )  I  Y  12  ZijXjxJ  +  A  a2 1 

i= 1  j=l  J  \i=  1  j= 1  J 

ELiE^%IIx/-^II2  +  2/3 

p(n  -  1)  +  2 (a  +  1) 


(3.29) 

(3.30) 
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Algorithm  3.4  Expectation  Maximization  for  (3.25) 

Input:  Data  points  x1; . . . ,  xn 
Initialize  A(0)  and  cr(2()) ,  set  k  =  0 

repeat 

Update  Z(fc+i)  by  (3.33)  with  and  afk) 

Update  A(fc+i)  by  (3.29)  with  Z(fc+ 1)  and  a2fc) 

Update  cr(2fc+1)  by  (3.30)  with  A(fc+1)  and  Z(fc+1) 
k  <—  k  +  1 

until  The  approximate  log  posterior  (3.25)  does  not  increase 


For  the  E-step,  we  resort  to  the  techniques  in  Sections  3.1 
Let 


Wij  := 


/  ||xj— Axj  ||2  \ 

1  2 -2  ;  ’ 


and  3.2  of  [Globerson  et  al.,  2007]. 


* 

*  =  3- 


(3.31) 


denote  the  weight  on  the  edge  Xj  to  x,.  Based  on  the  Laplacian  of  the  corresponding  weighted 
directed  graph,  we  define  the  following  matrix: 


{n,  j  =  l, 

E"'= i  Wfy,  *  =  3,  J  >  1,  (3-32) 

-11  q-  Mj,  7  >  1, 

which  replaces  the  first  column  of  the  Laplacian  with  a  non-negative  root  selection  score  vector 
r  e  77".  The  values  in  r  reflect  how  likely  each  sample  point  x,  would  be  the  root  of  a  spanning 
tree.  When  prior  knowledge  is  unavailable,  we  simply  set  r*  =  1,  i  —  1, . . . ,  n.  Then,  we 
compute  Zij  by 

Zij  =  (1  -  1{1  =  tyWijiL-1)*  -  (1  -  10  =  1  DWi^Z-1)^.  (3.33) 

We  determine  convergence  of  the  algorithm  by  checking  the  value  of  the  log  posterior  (3.25), 
which  is  computed  by 

logPiree(A,cr2|A")  oc  log  \L\  -  ^\\A\\2f  -  +  («  +  i)^  log  a2  - 

A  summary  of  the  EM  algorithm  is  in  Algorithm  3.4. 


3.2.4  Nonlinear  Extension  via  Kernel  Regression 

To  leam  nonlinear  dynamic  models,  we  extend  the  aforementioned  three  methods  through  the 
use  of  kernel  regression.  We  consider  nonlinear  dynamic  models  of  the  following  form: 

x(m)  =  P0(x(t))  +  e(t),  e(t)  ~  JV(-|0,  o2l).  (3.34) 
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where  (/>(■)  maps  a  point  in  W  into  a  Reproducing  Kernel  Hilbert  Space  (RKHS)  endowed  with 
a  kernel  function  /C(x,  y)  =  0(x)T0(y),  and  B  is  a  linear  mapping  from  the  RKHS  to  Mp. 
Replacing  Ax.j  in  (3.8),  (3.17)  and  (3.25)  by  Bcf>(xj)  then  leads  to  nonlinear  extensions  of  the 
three  approximate  log  posteriors. 

Next  we  extend  Algorithms  3.2,  3.3  and  3.4  for  learning  B  and  ct2.  For  the  E-steps,  we 
only  need  to  replace  Ax,,-  in  (3.11),  (3.24)  and  (3.31)  by  I30('x; ).  For  the  M-steps,  we  solve  the 
weighted  least  squares  problems  (3.12),  (3.21)  and  (3.27)  with  /he,  replaced  by  £?0(x.,).  The 
resulting  three  update  rules  for  B  and  a2  are  very  similar,  so  for  brevity  here  we  only  give  the 
one  for  maximizing  the  unordered  approximate  posterior: 


(n  n  \  /  n  n 

Zij0(xi)0(xi)T  +  A  a2 1 

i=  1  j=  1  /  \i=l  j= 1 

=  XZ(j)(X)T  (<j>(X) Az<P(X)T  +  A a2iyl 
=  XZ  (iTAj?  +  Act2/)"1  0(X)t, 

j2  = 

pn  +  2  (a  +  1) 


(3.35) 

(3.36) 

(3.37) 


where  X  :=  [xi  •  ■  ■  xn]  collects  the  data  into  a  p-by-n  matrix,  0(X)  [Afx, )  •  •  •  0(xn)] 
is  the  RKHS  mapping  of  the  entire  data  set,  is  a  diagonal  matrix  with  (A^)tt  :=  J2j=i  Zp, 
and  K  :=  (f>(X)T (p(X)  is  the  kernel  matrix.  We  obtain  (3.36)  from  (3.35)  by  using  the  Matrix 
Inversion  lemma. 

One  issue  with  the  above  extensions  is  that  we  cannot  compute  B  when  the  mapping  ©(•)  is 
of  infinite  dimension.  However,  we  observe  that  the  EM  procedures  only  make  use  of  Bc>(X), 
and  according  to  (3.36) 


B(j>{X)  =  XZ  (A'A^  +  Act2)  1  </>(X)T <f>(X) 
=  XZ  (KXz  +  Act2)  _1  K. 


Therefore,  instead  of  B  we  maintain  and  update  a  p-by-n  matrix  M  :=  XZ  (A'A^  +  Act2)  1  in 
the  EM  iterations.  To  predict  the  next  state  for  a  new  observation  x,  we  compute  M0(X)T0(x), 
which  also  only  requires  kernel  evaluations.  Alternatively,  we  may  compute  a  finite -dimensional 
approximation  to  <p(X)  by  doing  a  low-rank  factorization  of  the  kernel  matrix  K  «  o(X)  oiX), 
and  replace  (f>( X)  in  the  EM  procedure  with  <j>(X)  G  Wnxn,m  <  n.  This  can  be  viewed  as 
dimensionality  reduction  in  the  RKHS.  Then  we  can  maintain  and  update  B  e  Epxm  explicitly. 
To  do  prediction  on  a  set  of  new  data  points,  we  project  them  onto  the  basis  found  by  factorizing 
the  training  kernel  matrix,  thereby  computing  their  finite- dimensional  approximation  0,  and  then 
apply  the  estimated  B  to  the  mapped  points. 


3.3  Initialization  of  EM  by  Temporal  Smoothing 

All  of  the  proposed  methods  are  solving  non-convex  optimization  problems,  and  avoiding  local 
optima  is  a  critical  issue.  A  common  practice  in  applying  EM  methods  is  to  run  the  algorithm 
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multiple  times,  each  with  a  randomly  initialized  model,  and  then  choose  the  best  local  optimum 
as  the  final  estimate.  We  follow  this  practice  in  our  experiments  on  simulated  data  in  Section  3.4, 
but  observe  that  the  number  of  random  restarts  needed  to  obtain  a  good  model  is  usually  large, 
meaning  that  a  lot  of  random  initializations  lead  to  undesirable  local  optima.  Moreover,  our 
simulated  data  are  low  dimensional,  but  the  problem  caused  by  local  optima  will  only  become 
worse  in  a  higher  dimension,  which  is  common  with  real  data.  We  thus  investigate  an  alternative 
way  of  initialization. 

We  begin  by  observing  that  in  the  case  of  a  linear  dynamic  model,  the  samples  generated  by 
Algorithm  3.1  can  be  viewed  as  i.i.d.  samples  drawn  from  the  following  mixture  of  Gaussians: 

x  ~  (3-38) 

t=  i 

where  /z^  and  are  defined  in  (3.3)  and  txU]  >  0  is  the  probability  that  x  is  drawn  at  time  t. 
Based  on  this  view,  we  devise  a  heuristic  to  initialize  the  model: 

1.  Estimate  /z^’s  by  fitting  a  GMM  to  or  clustering  the  data 

2.  Estimate  the  true  temporal  order  of  /z^’s  based  on  their  estimates  from  Step  1 

3.  Leam  a  dynamic  model  from  the  estimated  sequence  of  /z^’s  by  existing  dynamic  model 
learning  methods 

For  Step  1  we  can  use  the  standard  EM  algorithm  for  learning  GMMs,  or  simply  the  k-means 
algorithm  since  subsequent  steps  only  need  estimates  of  the  means.  Step  2  in  its  own  right  is  a 
challenging  problem.  If  we  believe  temporally  close  /z^’ s  should  be  similar,  we  can  compute 
pairwise  distances  between  estimates  of  /z^’s  and  solve  a  traveling  salesman  problem  (TSP). 
Then  we  need  to  decide  the  direction  of  time  on  the  TSP  path,  which  is  often  impossible  without 
prior  or  expert  knowledge.  In  our  experiments  in  Sections  3.5  and  3.5.2  we  simply  try  both 
directions  and  report  the  one  that  performs  better.  In  high  dimensions,  Euclidean  distances  suffer 
from  the  curse  of  dimensionality  and  are  vulnerable  to  noise.  We  thus  propose  an  alternative  way 
to  recover  the  true  temporal  order,  which  is  based  on  the  idea  of  temporal  smoothing. 

Unlike  methods  proposed  in  previous  sections,  the  method  we  present  in  the  following  does 
not  make  any  assumptions  about  the  functional  form  of  the  underlying  dynamic  model.  It  only 
assumes  the  underlying  dynamics  to  be  smooth ,  i.e.,  the  curvature  of  the  trajectory  rolled  out 
by  the  dynamic  model  is  small.  More  precisely,  we  quantify  smoothness  by  the  second  order 
differences  of  temporally  adjacent  points  generated  by  the  dynamic  model: 

S=  ||(x^+1)  -  XW)  -  (XW  -  (3.39) 

t= 2 

where  Tmax  is  the  maximum  time.  Small  values  of  S  correspond  to  smooth  trajectories.  An  ex¬ 
ample  is  in  Figure  3.2.  Such  a  smoothness  measure  has  been  used  as  the  regularization  term  in  the 
Hodrick- Prescott  filter  [Hodrick  and  Prescott,  1997;  Lesser,  1961],  a  common  tool  in  macroeco¬ 
nomics  for  obtaining  a  smooth  and  nonlinear  representation  of  a  time  series. 

The  quantity  (3.39)  cannot  be  computed  on  our  data  since  the  true  time  indices  of  the  data 
points  are  missing.  Nevertheless,  it  can  be  succinctly  expressed  using  the  Laplacian  L  of  the 
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Figure  3.2:  An  example  of  smooth  (left)  v.s.  non-smooth  (right)  trajectory 

temporal  adjacency  graph  obtained  by  connecting  temporally  adjacent  pairs  of  data  points.  More 
specifically,  we  let  X  :=  [xx  ■  ■  •  xn]  be  the  p-by-n  data  matrix  as  before,  and  Z  be  a  directed 
temporal  adjacency  matrix  such  that  Z,?  =  1  if  x7  precedes  x,  immediately  in  time,  and  0 
otherwise.  Then,  we  define  Z  Z  +  Z  to  represent  the  undirected,  symmetric  temporal 
adjacency  of  the  data  points.  If  the  data  points  were  sorted  according  to  their  true  temporal  order, 
the  matrix  Z  would  consist  of  ones  in  the  upper-first  and  lower-first  off-diagonals  and  zeros 
elsewhere.  The  graph  Laplacian  based  on  the  adjacency  matrix  Z  is  then  L  =  diag(Zl)  —  Z, 
where  1  is  a  vector  of  ones  and  diag(Zl)  denotes  the  diagonal  matrix  with  the  vector  Z1  in  the 
main  diagonal.  Simple  algebraic  manipulation  shows  that  the  smoothness  S  can  be  expressed  in 
terms  of  L  (hence  Z)  as  follows: 

S(Z)  =  \\XLfF  =  Tr((diag(Zl)-Z)TXTX(diag(Zl)-Z)),  (3.40) 

which  is  a  quadratic  and  convex  function  of  Z  and  hence  Z.  Since  we  assume  the  true  dynamics 
to  be  smooth,  a  natural  way  to  reconstruct  a  temporal  ordering  would  be  to  solve  the  following 
problem: 


Z*  =  argnfin  S(Z) 

s.t.  Z  represents  a  directed  Hamiltonian  path  through  the  data  points.  (3.41) 


However,  this  problem  is  essentially  a  quadratic  version  of  TSP,  and  to  the  best  of  our  knowledge, 
no  efficient  solver  exists  for  such  problems.  We  thus  consider  the  following  two-step  heuristics. 
In  the  first  step,  we  minimize  S(Z)  under  a  modified  set  of  constraints: 


Z  =  argmin  S(Z) 

s.t.  Z1  =  1,  ZT1  =  1,  Zij  >  0,  Zu  =  0. 


(3.42) 


The  constraints  in  (3.42)  are  not  a  proper  relaxation  of  (3.41)  because  Z  must  have  one  zero  row 
and  one  zero  column  to  represent  a  Hamiltonian  path.  Nevertheless,  we  can  interpret  solving 
(3.42)  as  learning  a  pairwise  similarity  Z  whose  (i,  j)-th  entry  reflects  how  likely  x,  is  to  precede 
Xj  temporally.  Then  in  the  second  step,  we  solve  an  instance  of  TSP  with  1  —  (Z  +  ZT)/ 2  as 
the  distance,  and  obtain  an  ordering  from  the  optimal  TSP  path.  In  our  experiments  we  use  the 
state-of-the  art  Concorde  TSP  solver  [Applegate  et  al.],  which  implements  an  exact  algorithm 
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Algorithm  3.5  Projected  Gradient  Method  for  (3.42) 
Input:  Data  matrix  X  —  [xi  •  •  •  xn] 

Output:  Z 

1:  Seta  =  0.1,  e  =  10~6,  a  =  10“2 
2:  Initialize  Zm,  set  k  =  1 

3:  repeat 

4:  Compute  the  gradient  V(fc)  :=  'VS(Z^),  rj  <—  1.0 

5:  repeat 

6:  Z  =  Z{k)  -  77V (fc),  Drow  =  Dc°l  =  [0}nxn 

7:  repeat 

8:  Z  <-  Z 

9:  Z'.  fi-Projection((Z  —  Drow)i),  \/i 

10:  Drow  <-Z'-(Z-  Drow ) 

11:  Z"  (j -Project ion((Z'  —  Dcol).l),  Mj 

12:  Dco1  <-  Z"  -  (Z'  -  Dcoi) 

13:  Z  <-  Z"  ^ 

14:  until  || Z  —  Z||p  <  e 

15:  77  aij 

16:  until  S'(Z)  -  ^(Z^)  <  aTr(v(fc)(Z  -  Z(fc))) 

17:  t  —  t  +  1,  Z(fc)  < —  Z 

18:  until  ||Z(fc)  -  Z(fc_i)||F  <  e 
19:  Z  Z(fc) 


that  has  exponential  time-complexity  in  the  worst  case  but  is  very  efficient  in  practice  due  to  its 
carefully  designed  pruning  techniques. 

The  optimization  problem  (3.42)  is  essentially  convex  quadratic  programming  (QP)  under 
linear  and  bound  constraints.  However,  the  number  of  variables  is  quadratic  in  the  number  of 
data  points,  and  as  the  data  size  increases,  directly  applying  a  general-purpose  QP  or  nonlinear 
programming  solver  may  become  inefficient  or  even  infeasible.  We  thus  devise  a  simple  and 
efficient  projected  gradient  method  that  iteratively  updates  the  rows  and  the  columns  of  Z. 

The  key  idea  of  a  projected  gradient  method  is  to  move  the  parameter  vector  along  the  nega¬ 
tive  gradient  direction,  and  project  the  updated  vector  back  into  the  feasible  region  O  whenever 
it  goes  out.  The  cost  of  a  projected  gradient  procedure  is  mainly  determined  by  the  projection 
operation,  so  we  need  to  compute  efficiently  the  projection  step: 

Zt+1<-nn(Zt-7;VS,(Zt)),.  (3.43) 

n  =  (Z,  1  =  1,  ZTl  =  1,  %  >  0,  Zit  =  0},  (3.44) 

where  Z,.  and  Z.  ;  denote  a  row  and  a  column  of  Z  respectively,  and  flu  (a)  :=  argminb{||a  — 
b||  |  b  e  H}  is  the  Euclidean  projection  of  a  vector  a  onto  a  region  H.  The  gradient  of  S(Z)  is 
given  by 

VS(Z)  =  2(Q  +  Qt), 
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Algorithm  3.6  ^-Projection  [Duchi  et  al.,  2008] 

Input:  v  G  Mn 

Output:  w  :=  argmin  ||x  —  v||2  s.t.  xTl  =  1,  ag  >  0  Vi 
1:  Sort  v  into  p  :  pi  >  p2  >  . . .  >  /in. 

2:  Find  p  =  max{j  G  (1,  2, . . . ,  n}  :  pj  ~  j(Er=i  l-lr  ~  X)  >  °) 
3:  Define  6  =  ^(Ef=i/V  -  1) 

4:  Output  w  s.t.  Wi  =  max{nj  —  6 ,  0} 


where 


^ iij ? 

Q  :=  XTX(6\ag((Z  +  ZT)l)-(Z  +  ZT)). 

Moreover,  the  feasible  region  (3.44)  is  the  intersection  of  two  closed  convex  sets  0|  and  Q2: 

Oi  =  (Zj.l  =  1,  Zij  >  0,  Zu  =  0, 1  <  i,  j  <  n}, 

02  =  {ZTj  1  =  1,  >  0,  Zu  =  0, 1  <  i,  j  <  n}, 

which  correspond  to  the  normalization  constraints  for  rows  and  columns,  respectively.  Using 
Dykstra’s  cyclic  projection  algorithm  [Boyle  and  Dykstra,  1986],  we  perform  the  projection  op¬ 
eration  (3.43)  by  alternately  projecting  onto  fix  and  02.  A  very  nice  property  of  this  procedure 
is  that  projecting  onto  O,  or  02  alone  can  be  further  decomposed  as  doing  row-wise  (or  column¬ 
wise)  projections,  and  a  single-row  or  single-column  projection  can  be  computed  very  efficiently 
by  the  i\  projection  technique  Duchi  et  al.  [2008]  proposed,  which  we  outline  in  Algorithm  3.6. 
The  required  operations  are  simply  sorting  and  thresholding2.  Algorithm  3.5  gives  a  summary 
of  the  projected  gradient  method  for  the  optimization  problem  (3.42).  As  in  all  gradient-based 
methods,  we  conduct  back-tracking  line  search  for  the  step  size  r/  to  ensure  convergence. 


3.4  Experiments  on  Synthetic  Data 

We  consider  five  dynamical  systems.  The  first  three  are  linear  systems,  while  the  last  two  are 
nonlinear  systems.  Our  experiments  here  focus  on  the  unordered  approximation  (Section  3.2.1), 
the  partially-ordered  approximation  (Section  3.2.2),  and  their  kemelized  versions  (Section  3.2.4), 
referred  to  as  UM,  PM,  KUM  and  KPM,  respectively.  For  our  experiments  here  and  in  Section 
3.5,  we  implement  the  proposed  algorithms  in  MATLAB  and  use  the  maximum  directed  spanning 
tree  solver  available  at  http  :  / / edmonds-alg .  sourceforge  .  net/,  version  1.1.0. 

3.4.1  Linear  Systems 

We  consider  the  following  three  linear  systems. 

2For  the  ease  of  presentation,  in  Algorithm  3.6  we  ignore  the  constraint  ZVi  =  0,  which  can  be  easily  enforced 
by  setting  Zi%  =  0  and  updating  only  the  other  n  —  1  entries. 
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Figure  3.3:  2D  sample  points 
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Figure  3.4:  3D-1  sample  points 


Figure  3.5:  3D-2  sample  points 
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•  2D,  a  two-dimensional  diverging  system: 


'1.01 

0 

„o  — 

'50' 

0 

1.05 

>  x  - 

50 

A  sample  of  200  points  generated  by  Algorithm  3.1  is  shown  in  Figure  3.3(a). 
•  3D-1,  a  three-dimensional  diverging  system: 


'  1.1882 

0.3732 

0.1660  ' 

'10' 

A  = 

-0.1971 

0.8113 

-0.0107 

,x°  = 

10 

-0.1295 

-0.1886 

0.9628 

10 

The  Eigenvalues  of  the  transition  matrix  are  1.0143  and  0.9739  ±  0.24  i,  so  the  system 
dynamics  behaves  like  a  diverging  spiral  in  the  3-d  space.  A  sample  generated  by  Algo¬ 
rithm  3.1  is  shown  in  Figure  3.4(a),  suggesting  that  temporally  close  points  (those  along 
the  spiral)  can  be  spatially  further  away  from  each  other  than  temporally  remote  points 
(those  cutting  across  the  spiral). 

•  3D-2,  another  three-dimensional  diverging  system: 


'  1.0686 

-0.0893 

0.3098  ' 

'10' 

A  = 

0.4385 

1.0091 

-0.2884 

,x°  = 

10 

-0.0730 

0.0405 

0.9625 

10 

The  Eigenvalues  of  the  transition  matrix  are  1.0439  and  0.9982  ±  0.267  i.  A  sample  is 
shown  in  Figure  3.5(a).  Unlike  in  3D-1,  here  temporal  and  spatial  proximity  are  more 
consistent  with  each  other. 

While  the  results  presented  here  are  all  on  diverging  systems,  we  also  experimented  with  con¬ 
verging  systems  and  got  similar  results.  Using  Algorithm  3. 1,  we  generated  data  under  a  variety 
of  settings.  For  2D,  we  generated  40  data  sets,  each  containing  200  observations,  with  a  =  0.2. 
For  3D-1  and  3D-2,  we  varied  both  the  sample  sizes  and  a2.  For  the  small-sized  experiments, 
we  generated  40  data  sets,  each  containing  200  points,  with  a  =  0.2,  0.4,  0.6  and  0.8.  For  the 
large-sized  experiments,  we  generated  20  data  sets,  each  containing  2,000  points,  with  a  in  the 
same  range.  We  found  that  larger  values  of  a  overwhelmed  the  dynamics  to  such  an  extent  that 
no  algorithm  performed  well.  In  all  of  the  data  sets  we  set  Tmax  =  100. 

We  applied  UM  and  PM  to  these  data  sets,  maximizing  approximate  likelihood  functions 
without  any  prior  or  regularization  on  the  parameters  of  interest,  i.e.,  setting  the  hyper-parameters 
a,  (3,  and  A  in  (3.8)  and  (3.21)  to  zero.  For  every  data  set  we  ran  Algorithms  3.2  and  3.3  each  with 
M  random  initializations,  and  chose  the  model  with  the  largest  approximate  likelihood  as  the 
final  estimate.  The  entries  of  these  random  matrices  were  sampled  independently  and  uniformly 
from  [0, 1].  We  set  M  to  be  20  and  10  for  the  small-sized  and  the  large-sized  experiments, 
respectively. 

In  addition  to  random  initializations,  we  also  explored  the  use  of  manifold  learning  tech¬ 
niques  for  finding  initial  estimates.  The  rationale  is  that,  for  sample  points  generated  by  a  linear 
system,  there  should  be  a  one-dimensional  projection  that  indicates  the  correct  order  in  time. 
More  specifically,  we  applied  a  manifold  learning  technique  to  our  data,  and  mapped  the  sample 
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points  to  the  most  significant  coordinate  it  found.  Then,  we  sorted  the  data  points  according  to 
their  one-dimensional  projections  and  fitted  a  linear  dynamical  system  by  the  usual  least-square 
estimation  technique.  The  fitted  system  itself  is  already  an  estimate,  and  can  be  used  to  initialize 
Algorithms  3.2  and  3.3.  In  our  experiments,  we  found  Maximum  Variance  Unfolding  (MVU)  by 
Weinberger  et  al.  [2004]  to  be  the  best  manifold  learning  choice.  Finally,  to  indicate  the  baseline 
performance,  we  report  results  from  randomly  generated  matrices.  We  used  the  same  method 
and  generate  the  same  number,  M,  of  random  matrices  as  we  did  to  initialize  PM  and  UM,  and 
selected  the  one  with  the  highest  score.  We  refer  to  this  baseline  as  Rand. 

We  consider  two  performance  metrics.  The  first  compares  the  estimated  system  matrix  A 
and  the  true  A.  To  account  for  the  ambiguities  described  in  Section  3.1,  we  use  the  following 
rate-adjusted  matrix  error: 

ME(A,  A)  =  min  \\A  -  A* ||F,  (3.45) 

where  At  is  A  raised  to  the  power  t.  The  minimum  in  (3.45)  is  hard  to  solve,  so  we  search  for  t  in 
(±1,  ±2, . . . ,  ±10,  ±1/2,  ±1/3, ... ,  ±1/10}  and  choose  the  one  that  minimizes  (3.45).  Such  a 
metric  may  overstate  the  quality  of  an  estimate.  We  thus  consider  another  criterion  that  compares 
system  matrices  based  on  one-step  displacement  vectors 


CS(A,A) 


1 

n 


£ 


{Axj  -  x^)t(A?q  -  Xj) 
||Axj  —  Xj||||Axj  —  Xj|| 


(3.46) 


which  we  refer  to  as  the  cosine  score.  This  criterion  measures  the  similarity  between  the  one-step 
displacement  vector  Ax*  —  x,  of  the  true  system  and  that  of  the  estimated  system,  averaging  over 
all  the  sample  points;  a  higher  score  (3.46)  thus  means  a  better  estimate.  Note  that  cosine  is  a 
normalized  measure  of  similarity,  and  therefore  alleviates  the  issue  of  different  system  step  sizes. 
Also,  since  (3.46)  takes  the  absolute  value  after  averaging,  going  forward  and  backward  in  time 
are  considered  equally  good  as  long  as  they  do  so  consistently. 

We  tested  the  following  methods:  MVU,  PM+MVU  (PM  initialized  by  MVU),  PM,  UM+MVU 
(UM  initialized  by  MVU),  UM,  and  Rand.  Results  on  2D  are  in  Figure  3.6.  For  this  baseline 
system,  every  approach  performs  quite  well.  Figure  3.3(c)  shows  displacement  vectors  Ax,:  —  x, 
estimated  by  UM  in  one  of  the  2D  samples,  which  are  quite  consistent  with  the  true  dynamics. 

Performance  metrics  on  the  more  complex  systems  3D-1  and  3D-2  are  in  Figures  3.7  and 
3.8,  respectively.  To  qualitatively  demonstrate  the  performance,  we  also  plot  the  one-step  dis¬ 
placement  vectors  by  the  true  and  the  leamt  dynamic  models  in  Figures  3.4  and  3.5  for  3D-1  and 
3D-2.  Since  Rand  is  independent  of  data,  we  only  report  its  results  on  the  small  samples.  We  did 
not  apply  MVU  to  the  large-sized  samples  with  2,000  data  points,  since  its  underlying  semidefi- 
nite  program  requires  a  huge  amount  of  time  and  memory.  Moreover,  MVU  alone  usually  gave 
cosine  scores  as  low  as  Rand,  and  as  an  initialization,  it  provided  little  or  no  improvement  over 
random  initialization  in  most  cases  except  UM  in  small-sized  experiments  for  3D-1.  UM  was 
competitive  with  or  better  than  PM  in  quite  a  few  cases.  However,  on  the  small  samples  of  3D-1, 
PM  performed  much  better  than  UM.  We  also  see  that  as  the  sample  size  grew,  UM  improved 
more  significantly  than  PM  did.  This  suggests  that  imposing  directionality  constraints  may  im¬ 
prove  the  estimation  when  samples  are  small,  but  it  does  so  at  the  expense  of  introducing  some 
bias  to  the  estimate. 
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(a)  Matrix  error  (b)  Cosine  score 


Figure  3.6:  Results  on  2D,  a  =  0.2 


Rand 

(200) 


6009 


f  *  *  i 


MVU 

(200) 


T  i  : 
0 . 0 


.2  .4  .6 


PM+MVU 

(200) 


6 


PM 

(200) 


■a  a  u  u 


UM+MVU 

(200) 


ha 


.2  .4  .6  .8  .2  .4  .6 

(a)  matrix  error 


UM 

(200) 


0600 


*  + 


PM 

(2000) 


.2  .4  .6 


UM 

(2000) 


nosd 


(b)  cosine  score 


0 


.2  .4  .6 


0 


.2  .4  .6 


Rand 

(200) 


0  g  i  s 
+  *  *  * 


.2  .4  .6 


MVU 

(200) 


00 


Figure  3.7:  Results  on  3D-1 
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Figure  3.8:  Results  on  3D-2 
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Figure  3.9:  A  2000-points  sample  with  small  noise  o  =  0.2  on  which  UM  failed  (cosine  score 
0.0108).  For  better  visualization,  only  1/3  of  the  points  are  plotted. 

Regarding  the  effects  of  different  noise  levels,  most  methods  became  worse  as  the  noise  level 
increased.  While  UM  was  the  most  robust  against  noise  in  several  cases,  it  performed  very  badly 
on  the  large  samples  of  3D-1  when  a  =  0.2,  but  dramatically  improved  as  noise  increased. 
We  found  that  for  the  20  large  samples  of  3D-1  generated  with  a  =  0.2,  UM  recovered  the 
true  system  matrix  on  nearly  half  of  them,  but  totally  failed  on  the  rest.  When  it  failed,  the 
estimated  A  was  always  nearly  diagonal  and  exhibited  dynamics  as  depicted  in  Figure  3.9.  This 
is  a  concrete  example  of  the  identi fi ability  issue  pointed  out  in  Section  3.1. 

3.4.2  Nonlinear  Systems 

We  consider  the  following  two  systems. 

•  3D-conv,  a  converging  three-dimensional  nonlinear  system  considered  by  Girard  and  Pappas 
[2005],  governed  by  the  following  differential  equations: 

dx(t)/dt  =  —(1  +  0.1  y(t)2)x(t), 

dy(t)/dt  =  —  (1  —  0.1x(t)2)y(t) /2  +  2z(t),  (3.47) 

dz(t)/dt  =  (1  —  0.1x(t))2y(t)  —  z(t)/2, 

where  x(t),y(t),  and  z[t)  are  the  three  states  at  time  t.  The  initial  point  is  set  to  [5  1  5]T. 

•  Lorenz  Attractor  [Lorenz,  1963]: 

dx(t)/dt  =  10  (y(t)—x(t)), 
dy(t)/dt  =  x(t)(28  -  z(t))  -  y(t), 
dz(t)/dt  =  x(t)y(t)  —  8z(t)/3. 

The  initial  point  is  set  to  [0  1  1.05].  Figure  3.1 1(a)  shows  a  trajectory  of  800  points  evenly 
sampled  in  the  time  interval  [0, 20]. 
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Figure  3.10:  3D-conv  sample  points 


We  generate  data  from  these  two  systems  as  follows.  For  3D-conv  we  use  Algorithm  3.1  with  line 
5  replaced  by  a  discrete-time  approximation  of  the  system  equations  (3.47),  where  the  derivatives 
remain  constant  during  a  time  step  of  At  =  0.1: 

x(m)  ,=  x(0  +  0.i^!  +  e(0.  (3.48) 

at 

The  process  noise  e®  follows  a  zero-mean  Gaussian  with  standard  deviations  {O.lAt,  0.5At}. 
We  generate  20  training  data  sets  of  400  points  with  Tmax  =  100.  Figure  3.10(a)  shows  one  of 
the  data  sets.  For  Lorenz  Attractor  we  did  not  use  Algorithm  3.1  due  to  the  chaotic  nature  of  the 
system.  Instead,  we  add  independent  Gaussian  noise  to  a  system  trajectory  of  400  points  evenly 
sampled  in  the  time  interval  [10,  20]3  with  noise  standard  deviation  <Jnoise  G  {0.015,0.055}, 
where  5  is  the  median  of  all  the  pairwise  distances  of  the  800  points  shown  in  Figure  3.1 1(a).  For 
each  noise  level  we  generate  20  training  data  sets  without  the  true  temporal  order. 

Our  evaluation  scheme  here  is  slightly  different  from  Section  3.4.1.  Because  our  nonlinear 
methods  give  nonparametric  estimates  and  the  true  models  are  described  by  differential  equa¬ 
tions,  checking  the  model  estimation  error  is  difficult  and  not  considered.  We  focus  on  evaluating 
the  prediction  performance  in  terms  of  the  cosine  score.  For  3D-conv  we  evenly  sampled  200 
points  in  the  time  interval  [0, 10]  as  the  testing  sequence,  shown  in  Figure  3.10(b)  along  with  the 
true  dynamics  represented  by  vectors  of  displacement  between  consecutive  points.  For  Lorenz 
Attractor  we  use  the  noise-free  trajectory  of  400  points  as  the  testing  sequence.  Given  a  dynamic 
model  leamt  from  the  training  data,  we  predict  for  each  data  point  x^)  in  the  testing  sequence  its 
next  observation  x0+0  and  compute  the  testing  cosine  score: 


1 


T  —  1 


(x(m)  -  xw)T(x(m)  -  x(t)) 
A-'1  1x0+0  — x0)||  ||x0+1)  — x0)||  5 


(3.49) 


where  T  is  the  length  of  the  testing  sequence. 

3This  trajectory  is  the  second  half  of  the  trajectory  in  Figure  3.11(a).  It  preserves  the  butterfly  shape  but  leaves 
out  the  highly  dense  spirals  in  the  core  of  the  left  wing. 
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(a)  System  trajectory  colored  by  time 


(b)  Estimated  dynamics  by  KUM,  cosine  score  0.98 


Figure  3.11:  Lorenz  Attractor  sample  trajectory 


Table  3.3:  Cosine  scores  on  noise-free  data 


MVU 

UM 

PM 

KUM 

KPM 

3D-conv 

0.9954 

0.9903 

0.5570 

0.9909 

0.9225 

Lorenz 

0.1383 

0.5644 

0.2155 

0.9884 

0.334 

We  compare  the  two  nonlinear  versions  of  approximate  likelihood  methods,  KUM  and  KPM, 
against  the  Maximum  Variance  Unfolding  [Weinberger  et  al.,  2004]  based  approach  described  in 
Section  3.4. 1  combined  with  kernel  regression.  We  also  include  results  by  the  linear  methods  UM 
and  PM.  Hyper-parameter  settings  are  as  follows.  We  use  kernel  regression  with  the  Gaussian 
kernel  exp(— (||x— y||2)/(2/i)).  For  KUM  and  KPM,  we  set  the  kernel  bandwidth  h  to  505,  where 
5  is  the  median  of  all  pairwise  distances  in  a  training  data  set.  For  MVU  we  set  h  =  5.  Because 
plain  kernel  regression  tends  to  over-fit  the  training  data,  we  regularize  the  model  in  two  ways. 
First  we  make  use  of  the  standard  penalty  in  ridge  regression,  setting  the  regularization  parameter 
A  in  (3.36)  to  10“4  and  10~3  for  the  two  noise  levels  for  each  nonlinear  system.  Secondly,  when 
applying  KUM  and  KPM,  we  use  the  low-rank  approximation  to  the  kernel  matrix  described  in 
Section  3.2.4  with  rank  m  =  5,  reducing  the  model  complexity.  For  each  data  set  we  run  KUM 
and  KPM  with  50  random  initializations  of  the  regression  coefficient  matrix  B  and  a2.  Entries 
of  B  are  drawn  independently  from  a  zero-mean  Gaussian  with  standard  deviation  100,  and  a2 
is  drawn  uniformly  random  between  0  and  100  times  of  the  median  of  pairwise  distances. 

Results  are  in  Table  3.3,  Figures  3.12  and  3.13.  Table  3.3  reports  cosine  scores  obtained  by 
training  (without  temporal  order)  and  predicting  on  the  noise-free  trajectories  for  both  systems 
(Figures  3.10(a)  and  3.11(a)),  bold-facing  the  best  method  for  each  system.  For  3D-conv  all 
methods  perform  quite  well  except  PM.  Interestingly,  UM  performs  very  well  even  if  it  learns  a 
linear  model,  suggesting  3D-conv  may  be  well  approximated  by  a  linear  system  in  terms  of  one- 
step  predictions.  For  Lorenz  Attractor,  only  KUM  does  well  and  other  methods  are  significantly 
worse.  Figure  3.11(b)  shows  the  estimated  dynamics  by  KUM,  which  are  very  close  to  the  true 
dynamics.  However,  it  takes  hundreds  of  random  initializations  of  KUM  to  obtain  such  a  high 
cosine  score,  and  many  of  the  initializations  led  to  degenerate  or  undesirable  models. 
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(‘O  &  noise  —  O.lAt  (b)  (Jnoise  —  0.5At 


Figure  3.12:  Results  on  3D-conv 


(a)  @ noise  —  0.013  (b)  (Tnoise  —  0.053 


Figure  3.13:  Results  on  Lorenz  Attractor 


Figure  3.12  gives  boxplots  of  cosine  scores  for  3D-conv  obtained  by  training  on  the  20  noisy 
data  sets  and  predicting  on  the  noise-free  testing  sequence.  The  proposed  methods  outperform 
the  MVU  based  approach  significantly,  but  performances  have  a  large  variance  across  different 
training  data  sets.  Linear  and  nonlinear  methods  achieve  comparable  scores,  showing  again  that 
3D-conv  may  be  well  approximated  by  a  linear  model  in  a  short  time  period. 

Figure  3.13  shows  cosine  scores  on  Lorenz  Attractor.  MVU  still  performs  poorly,  but  the 
proposed  methods  all  perform  worse  than  in  3D-conv,  especially  UM.  This  is  not  surprising 
because  Lorenz  Attractor  has  more  complex  dynamics  than  3D-conv.  We  also  see  that  although 
the  median  score  of  PM  across  all  the  20  training  data  sets  is  better  than  those  of  KUM  and 
KPM,  the  nonlinear  methods  are  able  to  reach  a  much  higher  score  than  PM.  This  indicates  that 
the  nonlinear  methods  are  on  the  one  hand  more  powerful  than  the  linear  methods,  but  on  the 
other  hand  more  vulnerable  to  overfitting  and  require  careful  initialization/regularization  based 
on  domain  or  prior  knowledge. 


3.5  Experiments  on  Real  Data 

We  consider  three  real  data  sets.  For  the  purpose  of  evaluation,  we  choose  data  whose  temporal 
orderings  are  known:  a  video  stream  of  a  swinging  pendulum  (Section  3.5),  gene  expression  time 
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series  of  the  yeast  metabolic  cycle  (Section  3.5.2),  and  time  series  of  HeLa  cell  images  (Section 
3.5.3).  While  the  first  one  is  for  evaluation  purposes  only,  the  other  two  data  sets,  as  briefly 
described  in  Chapter  1  and  explained  in  more  detail  later,  represent  application  areas  that  would 
truly  benefit  from  the  proposed  methods  of  learning  dynamic  models  from  non-sequence  data. 

In  real  data  we  do  not  have  true  dynamic  models  to  compare  with,  but  knowing  the  temporal 
order  allows  us  to  obtain  reference  models  by  applying  existing  dynamic  model  learning  methods 
with  the  available  temporal  order.  Our  main  evaluation  scheme  is  then  to  compare  the  prediction 
performances  of  the  proposed  methods  against  standard  methods  of  learning  from  sequence  data. 
We  consider  two  performance  metrics.  One  is  the  cosine  score  (3.49).  However,  since  the  real 
data  are  higher-dimensional,  when  interpreting  cosine  scores  we  need  to  account  for  the  effect  of 
high  dimensions.  To  see  that  effect,  we  consider  the  probability  that  random  prediction  achieves 
a  cosine  score  of  s  or  greater.  Let  the  random  variable  S  denote  the  cosine  between  a  vector 
drawn  uniformly  at  random  from  the  unit  p-sphere  and  an  arbitrary  fixed  p-dimensional  unit 
vector.  Basic  geometry  shows  that  the  probability  \S\  >  s  for  some  s  >  0  is  equivalent  to  two 
times  the  ratio  of  the  surface  area  of  a  cap  with  height  1  —  s  on  a  unit  p- sphere  to  the  unit  p-sphere 
surface  area,  which  has  the  following  closed-form  [Li,  2011]: 

Proms'!  >  s)  =  Ii-S2  (^4  0  ,  (3.50) 

where  Ix(a,  b )  is  the  regularized  incomplete  beta  function.  Now  consider  the  cosine  score  of  n 
independent  random  predictions  |S|  :=  \^i=y  l\ ,  where  Si  s  are  independent  copies  of  S.  By 
Bennett’s  inequality  [1962]  and  symmetry  of  S,  we  have 

Prob(|S|  >  s)  —  2Prob(S  >  s)  <  2  exp  (— nVar(S)/r(s/Var(S))) ,  (3.51) 


where  h(x)  :  =  (1  +  x)  log(l  +  x)  —  x.  To  derive  Var(S),  we  first  obtain  the  p.d.f  of  S: 


fs(s)  = 


dProb(S  <  s) 


2  i2j)  (1-S2) 


ds  ds 

where  B(x,  y )  is  the  beta  function.  Then  we  have 

1  b{ ^  ^ 

^  V  2  ’  2 


(3.52) 


Var(S)  =  s  fs(s)ds  =  _1  , 

Sj  -°V  2  ’  2 


leading  to  the  following  upper  bound: 

Prob(|S|  >  s)  <  2  exp (—nh(sp)/p)  =  2 


2  ’  2/  _  ^-1 


=  P 


(3.53) 


exp(s) 


(i  +  spy/p+s 


(3.54) 


which  decreases  in  the  order  of  p~sn  for  fixed  s.  Figure  3.14  shows  this  upper  bound  for  four 
values  of  s  and  n  =  10  as  a  function  of  p.  We  can  see  that  when  the  dimension  p  is  large,  even  if 
the  number  n  of  predictions  is  small  (as  in  Section  3.5.2)  it  is  still  difficult  for  random  prediction 
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dimension  p 

Figure  3.14:  Chance  probability  (3.54)  for  various  cosine  scores  against  the  dimension  p 


to  achieve  some  modest  cosine  score,  say  0.3.  In  later  sections  we  will  use  the  upper  bound 
(3.54)  to  provide  some  sense  of  significance  of  the  cosine  scores  obtained  in  our  experiments. 

Our  second  performance  metric  is  normalized  error: 

1 

T-  1 

which  measures  how  close  the  predictions  are  to  the  true  next  state  vectors.  A  smaller  normalized 
error  means  a  better  prediction,  and  predicting  with  the  current  observation  (x^+1)  =  x(/))  gives 
a  normalized  error  of  one. 

In  all  three  data  sets  we  apply  UM  and  PM,  and  in  Sections  3.5  and  3.5.2  also  use  the  tree- 
based  EM  (TEM)  method.  For  the  experiments  in  Section  3.5.3  we  use  random  initialization, 
while  in  Sections  3.5  and  3.5.2  we  apply  the  temporal  clustering  heuristics  in  Section  3.3  for 
initialization  with  the  following  detailed  settings. 

We  use  the  K-means  method  to  cluster  the  data  points  and  compare  the  following  four  meth¬ 
ods  for  ordering  the  cluster  centers: 

1.  MVU:  Project  the  cluster  centers  to  the  one-dimensional  space  found  by  Maximum  Vari¬ 
ance  Unfolding,  and  then  sort  the  cluster  centers  according  to  the  projections. 

2.  11+TSP:  Solve  a  TSP  with  the  1-norm  pairwise  distances  between  the  cluster  centers. 

3.  12+TSP:  Solve  a  TSP  with  the  2-norm  pairwise  distances  between  the  cluster  centers. 

4.  TSM+TSP:  The  two-step  heuristics  outlined  in  Section  3.3. 

Then,  we  learn  a  linear  model  (3.1)  from  the  ordered  cluster  centers,  and  initialize  the  proposed 
methods  with  the  learnt  model.  As  mentioned  in  Section  3.3,  methods  based  on  TSP  do  not 
decide  the  overall  direction  of  time.  Here  we  learn  dynamic  models  using  both  directions,  and 
report  the  one  that  leads  to  a  better  prediction  performance.  To  solve  a  TSP,  we  use  the  state-of- 
the-art  Concorde  TSP  solver  [Applegate  et  al.]. 

For  UM  and  TEM,  we  not  only  initialize  the  estimation  procedures  with  clustering,  but  also 
consider  restricting  the  approximate  likelihood  functions  by  the  ordering  of  the  cluster  centers: 
when  summing  over  the  latent  variables  in  (3.8)  and  (3.25),  we  only  include  those  consistent  with 


T  ^  11  r(£“Tl)  _  I 


E 

t= 1 


r(*+l)  —  xW  I 


(3.55) 
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Figure  3.15:  A  frame  of  the  swinging  pendulum  video  stream. 

the  ordering  of  the  clusters.  We  refer  to  the  restricted  versions  of  UM  and  TEM  respectively  as 
“UM  rest”  and  “TEM  rest.” 

We  extend  the  proposed  methods  to  allow  each  state  variable  to  have  a  different  noise  vari¬ 
ance.  The  update  rules  can  be  easily  derived  from  those  in  Section  3.2  and  have  a  similar  form. 
We  choose  the  regularization  parameter  A  by  leave-one-out  cross  validation  on  the  ordered  cluster 
centers,  but  set  a  and  f3  by  manual  selection.  Our  choice  of  a  and  3  is  mainly  to  avoid  numerical 
issues  caused  by  small  values  of  the  estimated  noise  variances  during  the  EM  iterations.  We  find 
the  follow  choice  to  be  effective:  a  =  1  and  (3  ~  n,  which  correspond  to  a  prior  of  noise  variance 
whose  mean  is  around  n,  and  in  our  experiments  leads  to  a  posterior  mean  around  2. 

3.5.1  Video  of  Swinging  Pendulum 

Our  first  real  data  is  a  video  analyzed  by  Siddiqi  et  al.  [2010].  The  video  consists  of  500  frames 
of  240-by-240  colored  images  of  a  swinging  pendulum.  An  image  is  shown  in  Figure  3.15. 
The  underlying  dynamics  is  highly  periodic  and  stable  as  the  pendulum  completes  about  22  full 
swings4.  We  center  the  pixel  values  to  be  zero  across  the  500  frames,  and  then  apply  Singular 
Value  Decomposition  (SVD)  to  reduce  the  dimension  from  240  x  240  x  3  =  172800  to  20 
by  projecting  the  data  onto  the  subspace  corresponding  to  the  20  largest  singular  values.  Such 
a  subspace  preserves  about  72  percent  of  the  total  energy.  We  further  normalize  each  of  the 
20  temporal  sequences  to  be  zero-mean  and  unit-variance.  Then  we  use  the  first  400  points  as 
training  data  and  the  last  100  points  as  testing  data. 

In  the  initialization  step  we  combine  the  K-means  method  with  the  AIC  criterion  to  determine 
the  number  of  clusters.  For  each  possible  number  of  clusters  in  our  search  range,  we  run  the  Bo¬ 
rneans  method  with  30  random  restarts  and  choose  the  best  clustering  to  initialize  the  dynamic 
model.  We  repeatedly  train  30  linear  dynamic  models,  all  of  which  are  initialized  by  K-means 
combined  with  AIC.  In  most  of  the  30  runs  the  number  of  clusters  determined  by  K-means  and 
AIC  is  31,  which  is  about  the  number  of  time  steps  one  full  swing  takes.  We  then  evaluate  these 
learnt  models  by  their  prediction  performances  on  the  test  data. 

4 A  full  swing  means  the  pendulum  ended  where  it  started. 
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Figure  3.16:  Cosine  scores  on  the  pendulum  data  by  the  linear  model.  Larger  is  better.  The  blue 
dashed  line  is  by  a  dynamic  model  learnt  with  the  known  temporal  order. 


(a)  Ordered  by  MVU  (b)  Ordered  by  11+TSP 


(c)  Ordered  by  12+TSP  (d)  Ordered  by  TSM+TSP 


Figure  3.17:  Normalized  errors  on  the  pendulum  data  by  the  linear  model.  Smaller  is  better.  The 
blue  dashed  line  is  by  a  model  learnt  with  the  known  order. 
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We  present  the  box  plots  of  the  testing  cosine  scores  and  normalized  errors  in  Figures  3.16 
and  3.17.  The  left  most  column  in  each  plot,  the  “kmeans”  column,  gives  the  performance  of  the 
initial  model  found  by  K-means  and  some  ordering  method.  In  each  box-plot  we  also  indicate 
the  performance  of  the  reference  model  leamt  with  the  known  temporal  ordering.  There  are  two 
main  observations: 

•  Comparing  the  four  ordering  methods,  we  find  that  MVU  is  worse  than  11+TSP  and 
12+TSP,  which  are  in  turn  worse  than  TSM+TSP.  Moreover,  TSM+TSP  does  almost  as 
well  as  the  model  leamt  with  the  known  temporal  ordering.  This  suggests  that  orderings 
solely  based  on  pairwise  distances,  such  as  those  by  MVU,  11+TSP,  and  12+TSP,  may  be 
more  sensitive  to  distances  between  cluster  centers,  which  are  not  always  equally  separated 
in  space.  On  the  contrary,  TSM+TSP  is  more  robust  against  irregular  distances,  suggest¬ 
ing  that  the  pairwise  similarity  leamt  through  solving  the  convex  program  (3.42)  better 
captures  the  dynamic  nature  of  the  data. 

•  The  initial  models  leamt  from  ordered  cluster  centers  already  perform  quite  well,  and  the 
proposed  methods  result  in  only  marginal  improvements.  Moreover,  without  the  restric¬ 
tion  imposed  by  cluster  orderings  UM  even  performs  worse  than  the  initial  model.  This 
suggests  that  our  approximation  to  the  likelihood  function  may  introduce  too  many  unde¬ 
sirable  local  maxima. 

3.5.2  Gene  Expression  Time  Series  of  Yeast  Metabolic  Cycle 

To  study  gene  expression  dynamics  of  yeasts  during  the  metabolic  cycle,  Tu  et  al.  [2005]  col¬ 
lected  expression  profiles  of  about  6,000  yeast  genes  along  three  consecutive  metabolic  cycles, 
each  containing  12  samples.  Due  to  the  destructive  nature  of  the  measurement  technique,  gene 
expression  profiles  were  measured  on  different  yeast  cells,  and  therefore  synchronization  of  yeast 
cells  in  the  metabolic  cycle  is  necessary  for  obtaining  reliable  gene  expression  time  series  data. 
To  address  this  issue,  Tu  et  al.  [2005]  developed  a  continuous  culture  system  that  provides  a  sta¬ 
ble  environment  for  yeast  cells  to  grow,  and  chose  a  particular  strain  of  yeasts  that  exhibit  “unusu¬ 
ally  robust  periodic  behavior,”  i.e.,  cells  of  that  strain  of  yeasts  are  in  a  sense  self-synchronizing. 
However,  Tu  et  al.  [2005]  noted  that  the  periodic  gene  expression  observed  in  their  experiment  is 
more  robust  than  those  in  certain  other  species  (c.f.  Discussion  in  [Tu  et  al.,  2005]),  suggesting 
that  in  general  it  may  be  quite  difficult  to  obtain  reliable  time  series  gene  expression  measure¬ 
ments.  In  those  cases,  our  proposed  methods  of  learning  dynamic  models  from  non-sequenced 
data  may  be  very  useful. 

We  focus  on  a  subset  of  3,552  genes  found  by  Tu  et  al.  [2005]  to  exhibit  strong  periodical 
behaviors  during  the  metabolic  cycle.  We  normalize  each  gene  expression  time  series  to  be  zero- 
mean  and  unit-variance,  and  use  the  first  two  cycles  (24  points)  as  training  data  and  the  last 
cycle  (12  points)  as  testing  data.  Here  the  number  of  state  variables  (genes)  is  much  higher  than 
the  sample  size,  and  thus  learning  dynamic  models  is  much  more  difficult  than  in  the  previous 
experiment. 

Since  the  number  of  sample  points  is  disproportionally  smaller  than  the  dimension,  in  the 
initialization  step  we  specify  the  number  of  clusters  in  the  K-means  method  to  be  12,  the  number 
of  time  steps  in  one  cycle.  This  means  each  cluster  will  contain  only  few  data  points.  We 
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Figure  3.18:  Gene  expression  profiles  in  three  major  gene  groups:  MRPL10,  POX1,  and 
RPL17B.  Top  row:  original  gene  expression.  Bottom  row:  gene  expression  from  estimated 
cluster  centers  ordered  by  TSM+TSP. 


repeatedly  train  20  linear  models,  and  in  each  of  the  20  runs  we  randomly  restart  the  K-means 
method  30  times  and  choose  the  best  clustering  to  initialize  the  model. 

To  evaluate  the  proposed  methods,  we  first  qualitatively  examine  our  initial  step  of  temporal 
clustering  and  ordering.  Among  the  3,552  genes,  MRPL10,  POX1  and  RPL17B  were  found  to 
be  strongly  periodical  and  yet  exhibit  different  dynamics.  Treating  these  three  genes  as  fixed 
seeds  in  clustering  analysis,  Tu  et  al.  [2005]  identified  three  major  clusters  of  genes.  From  each 
cluster  we  pick  the  24  most  representative  genes  and  plot  their  average  expression  profiles  over 
the  first  two  cycles  in  the  top  row  of  Figure  3.18.  In  the  bottom  row  of  the  same  figure  we  plot 
the  expression  profiles  of  the  same  genes  from  the  estimated  cluster  centers  in  the  order  found  by 
TSM+TSP.  Comparing  the  two  rows  shows  our  initial  step  of  temporal  clustering  and  ordering 
effectively  recovers  the  major  trends  of  gene  dynamics. 

We  then  evaluate  the  proposed  methods  quantitatively.  Figures  3.19  and  3.20  present  box- 
plots  of  cosine  scores  and  normalized  errors.  The  cosine  scores  are  between  0.6  and  0.7,  which 
by  themselves  do  not  seem  impressive,  but  because  of  the  high  dimension,  the  probability  for 
random  predictions  to  achieve  such  scores,  according  to  (3.54),  is  less  than  10-19  even  though  the 
testing  sequence  is  short.  Moreover,  the  improvements  due  to  the  proposed  EM-based  methods 
over  the  initial  model  are  more  significant  here  than  in  Section  3.5,  though  the  gap  between 
the  proposed  methods  and  the  sequential  learning  method  is  bigger.  Most  of  the  performance 
measures  here  are  rather  stable  across  different  runs,  possibly  because  on  such  a  small  sample 
most  initializations  turn  out  to  be  similar.  The  only  exception  is  TEM,  which  occasionally  results 
in  extremely  poor  performance.  This  is  due  to  numerical  difficulties  encountered  in  its  E-step; 
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Figure  3.19:  Cosine  scores  on  the  yeast  time  series.  Larger  is  better.  The  blue  dashed  line  is  by 
a  dynamic  model  learnt  using  the  known  temporal  order. 


Figure  3.20:  Normalized  errors  on  the  yeast  time  series.  Smaller  is  better.  The  blue  dashed  line 
is  by  a  dynamic  model  learnt  using  the  known  temporal  order. 
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the  main  computation  there  is  inverting  a  matrix  of  exponentiated  negative  distances,  which 
are  numerically  unstable  for  high  dimensional  data  points.  Regarding  the  different  ordering 
methods,  unlike  in  Section  3.5  TSM+TSP  does  not  outperform  the  other  ordering  methods;  all 
four  methods  perform  equally  well.  Again,  this  may  be  attributed  to  the  training  points  being  too 
few  for  different  ordering  methods  to  behave  differently. 

3.5.3  Cell  Image  Time  Series 

We  apply  the  proposed  method  to  a  time  series  data  set  of  HeLa  cell  images  originally  collected 
by  Zhou  et  al.  [2009],  and  subsequently  analyzed  by  Buck  et  al.  [2009],  who  were  interested  in 
the  dependence  of  protein  subcellular  localization  on  the  cell  cycle.  Instead  of  relying  on  time- 
series  cell  images  as  in  most  existing  studies,  they  aim  to  utilize  static,  asynchronous  snapshots 
taken  from  multiple  cells  at  various  phases  of  the  cell  cycle  because  such  images  are  easier  to 
obtain  on  a  large  scale  than  time-series  images.  To  do  so,  they  proposed  to  find  a  one-dimensional 
surrogate  of  cell  cycle  time  from  static  cell  image  features  by  manifold  learning  techniques,  and 
verified  on  real  data  that  such  a  surrogate  is  well  correlated  with  the  cell  cycle.  However,  it  is 
not  clear  how  to  use  or  augment  their  approach  for  predictive  analysis,  which  can  be  important 
in  understanding  cell  dynamics.  In  contrast,  our  work  bypasses  the  issue  of  estimating  the  cell 
cycle  time  and  focuses  directly  on  learning  dynamic  models. 

The  data  set  consists  of  100  time  frames,  and  each  frame  contains  from  tens  to  a  hundred  or  so 
cell  regions.  Details  regarding  cell  segmentation  and  tracking  can  be  found  in  [Zhou  et  al.,  2009]. 
Each  segmented  cell  region  is  represented  by  a  49  dimensional  feature  vector  as  in  [Buck  et  al., 
2009].  During  the  100  time  frames,  some  cells  went  through  more  than  one  division  while  others 
never  divided.  Buck  et  al.  [2009]  identified  a  total  of  34  sequences  of  cells  that  completed  at 
least  one  full  division-to-division  cell  cycle  spanning  at  least  30  time  frames,  and  conducted 
their  analysis  on  these  sequences.  We  instead  treat  these  34  sequences,  which  contain  a  total  of 
1,740  data  points,  as  testing  data,  and  run  UM  and  PM  on  the  other  short  or  incomplete  sequences 
as  if  they  were  non-sequence  samples.  Out  of  the  7,692  feature  points  in  these  partial  cell  cycle 
sequences,  1,267  appear  in  only  one  time  frame,  fitting  exactly  our  non-sequence  scenario.  We 
normalize  the  entire  data  set  so  that  each  feature  has  mean  zero  and  standard  deviation  1.  To 
obtain  a  performance  reference  from  models  learnt  with  sequence  information,  we  apply  least 
square  ridge  regression  to  partial  sequences  of  length  at  least  6,  a  total  of  6099  feature  points. 
The  regularization  parameter  for  ridge  regression  was  chosen  by  training  on  the  first  half  of  each 
training  sequence  and  validating  on  the  second  half. 

In  applying  UM  and  PM  we  made  several  changes.  We  allow  each  feature  to  have  a  different 
noise  variance,  but  instead  of  optimizing  over  its  value,  we  simply  set  the  noise  variance  of  each 
feature  to  be  the  median  of  pairwise  distances  between  training  data  points  along  that  feature 
dimension.  Moreover,  we  add  an  extra  regularization  term  2_3||A  —  I\\2F  to  our  approximate 
likelihood  functions  and  set  A  for  the  penalty  perm  on  A  to  be  1.  We  initialize  UM  and  PM 
with  20  different  models,  one  being  the  identity  matrix  and  the  other  19  having  entries  drawn 
independently  from  a  standard  normal  distribution.  We  choose  the  final  estimate  based  on  the 
regularized  approximate  likelihood  functions  for  UM  and  PM,  respectively. 

We  compare  with  a  baseline  that  uses  manifold  learning.  Following  Bucketal.  [2009], 
we  use  Isomap  [Tenenbaum  et  al.,  2000]  to  map  all  the  7,692  training  data  points  to  a  two- 
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Figure  3.21:  Testing  performances  on  cell  image  time  series 


dimensional  space,  sort  the  data  points  according  to  their  mappings  along  the  first  coordinate, 
and  then  apply  ridge  regression  with  both  the  estimated  ordering  and  the  reverse  ordering.  At 
test  time,  we  apply  both  leamt  models  and  report  the  better  performance. 

Figures  3.21(a)  and  3.21(b)  are  boxplots  of  the  two  performance  measures  over  the  34  testing 
sequences.  Again,  although  the  cosine  scores  do  not  seem  impressive,  the  chance  probability 
for  achieving  such  scores,  given  by  (3.50),  is  lower  than  10-6.  As  expected,  Ridge  performs 
the  best,  but  the  proposed  UM  and  PM  are  quite  close  and  even  have  a  smaller  variance  in  nor¬ 
malized  error.  The  baseline  that  uses  Isomap  is  competitive  with  UM  in  terms  of  cosine  score, 
but  shows  larger  variability  across  different  testing  sequences  and  has  much  larger  normalized 
errors.  Unlike  in  the  last  two  experiments,  PM  performs  noticeably  worse  than  UM  here.  We 
suspect  that  PM’s  strong  approximation  bias  of  enforcing  the  spanning-tree  constraint  hinders 
effective  use  of  this  quite  large  data  set,  but  do  not  have  solid  empirical  evidences.  More  gener¬ 
ally,  it  requires  further  research  to  understand  when  UM  or  PM  will  be  a  better  choice  in  terms 
of  quality  of  the  leamt  model,  but  UM  certainly  scales  better  with  the  data  size:  UM’s  E  step  of 
normalization  is  embarrassingly  parallelizable,  whereas  PM’s  maximum  directed  spanning  tree 
procedure  is  harder  to  parallelize.  In  this  experiment,  our  MATLAB  implementation  of  UM, 
which  performs  efficient  matrix  normalization  via  parallelization,  is  more  than  10  times  faster 
than  PM,  which  spends  most  of  the  running  time  in  the  maximum  directed  spanning  tree  solver 
(http  :  / / edmonds-alg .  sourceforge  .  net/,  version  1.1.0). 

Another  way  of  evaluating  the  proposed  approximate  likelihood  functions  is  to  check  whether 
a  better  training  likelihood  leads  to  a  better  testing  performance.  Figure  3.22  gives  scatter  plots 
of  the  two  testing  performance  measures  against  regularized  UM  and  PM  training  negative  like¬ 
lihood  functions  over  the  20  initializations.  Each  curve  represents  results  on  a  testing  sequence 
and  is  sorted  by  the  training  likelihood.  We  can  see  that  for  both  UM  and  PM,  the  training  ap¬ 
proximate  likelihood  has  a  rather  small  numerical  range,  and  there  is  no  significant  correlation 
between  the  testing  performance  and  the  training  likelihood.  Moreover,  on  most  testing  se¬ 
quences  the  UM  performances  are  similar  across  the  20  initial  models,  while  the  PM  normalized 
error  has  a  larger  variation.  These  results  illustrate  limitations  of  our  proposed  methods  when 
applied  to  real  data,  and  suggest  that  a  different  strategy  of  choosing  the  final  estimate  is  needed 
in  order  to  achieve  better  testing  performance,  which  we  leave  as  an  open  issue  for  future  work. 
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(c)  PM:  normalized  error 


fd)  PM:  cosine  score 


Figure  3.22:  Scatter  plot  of  testing  performance  against  training  approximate  likelihood 


3.6  Conclusion 

In  this  chapter  we  consider  learning  fully  observable  dynamic  models  from  data  drawn  from 
independent  trajectories  at  unknown  times.  Acknowledging  several  identihability  issues,  we 
propose  learning  methods  based  on  maximizing  various  approximate  likelihood  functions  via 
EM-type  algorithms,  together  with  novel  initialization  methods.  Experiments  on  synthetic  and 
real  data  demonstrate  that  the  proposed  methods  can  leam  reasonably  good  models  from  non¬ 
sequence  data,  though  their  success  requires  some  hyper-parameter  tuning,  and  more  critically, 
good  initialization.  We  thus  in  later  chapters  consider  settings  requiring  extra  information  or 
assumptions,  but  leading  to  simpler  learning  problems. 
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Chapter  4 

Learning  Vector  Autoregressive  Models 
from  Sequence  and  Non-sequence  Data 


As  concluded  by  the  previous  chapter,  the  assumption  of  multi-trajectory,  independent  samples 
leads  to  several  identifi ability  issues  that  compromise  effective  learning.  We  thus  consider  mak¬ 
ing  stronger  assumptions  in  several  ways,  starting  in  this  chapter  with  the  availability  of  a  small 
amount  of  sequence  data  in  addition  to  the  supposedly  larger  non-sequence  data.  Our  goal  is  to 
combine  these  two  types  of  data  to  improve  dynamic  model  learning.  As  in  the  previous  chapter, 
we  consider  p-dimensional  vector  auto-regressive  models,  but  treat  the  state  variables  as  a  row 
vector  instead  of  a  column  vector: 


x(m)  =  x(t)A  +  e(m),  (4.1) 

where  is  again  an  independent  Gaussian  noise  process  with  a  time-invariant  variance  a2 1. 
In  addition,  we  assume  that  the  process  (4.1)  is  stable,  i.e.,  the  eigenvalues  of  A  have  modulus 
less  than  one.  As  a  result,  the  process  (4.1)  has  a  stationary  distribution,  whose  covariance  Q  is 
determined  by  the  following  discrete-time  Lyapunov  equation: 

AtQA  +  a2 1  =  Q.  (4.2) 

Linear  quadratic  Lyapunov  theory  (see  e.g.,  [Antsaklis  and  Michel,  2005])  gives  that  0  is  uniquely 
determined  if  and  only  if  Aj(A)Aj(A)  ^  1  for  1  <  i,  j  <  p,  where  A* (A)  is  the  i-th  eigenvalue  of 
A.  If  the  noise  process  et  follows  a  normal  distribution,  the  stationary  distribution  also  follows  a 
normal  distribution,  with  covariance  Q  determined  as  above.  Since  our  goal  is  to  estimate  A,  a 
more  relevant  perspective  is  viewing  (4.2)  as  a  system  of  constraints  on  A.  What  motivates  the 
propose  approach  in  this  chapter  is  that  the  estimation  of  Q  requires  only  samples  drawn  from 
the  stationary  distribution  rather  than  sequence  data.  However,  even  if  we  have  the  true  Q  and 
a2,  we  still  cannot  uniquely  determine  A  because  (4.2)  is  an  under-determined  system1  of  A.  We 
thus  rely  on  the  few  sequence  samples  to  resolve  the  ambiguity. 

Let  {xW}^j  be  a  sequence  of  observations  generated  by  the  process  (4.1).  The  standard 
least-square  estimator  for  the  transition  matrix  A  is  the  solution  to  the  following  minimization 

’if  we  further  require  A  to  be  symmetric,  (4.2)  would  be  a  simplified  Continuous-time  Algebraic  Riccati  Equa¬ 
tion ,  which  has  a  unique  solution  under  some  conditions  (c.f.  [Antsaklis  and  Michel,  2005]). 
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problem: 

nun  \\Y-XA\\2f,  (4.3) 

where  Y'T  :=  [(x^)T  (x(3>)T  •  •  ■  (x<T>)T],  XT  :=  [(x<1))T  (X(2))T  •  •  •  (x^-f],  and  ||  •  \\F 

denotes  the  matrix  Frobenius  norm.  When  p  >  T,  which  is  often  the  case  in  modem  time  series 
modeling  tasks,  the  least  square  problem  (4.3)  has  multiple  solutions  all  achieving  zero  squared 
error,  and  the  resulting  estimator  overfitts  the  data.  A  common  remedy  is  adding  a  penalty  term 
on  A  to  (4.3)  and  minimizing  the  resulting  regularized  sum  of  squared  errors.  Usual  penalty 
terms  include  the  ridge  penalty  ||A|||i  and  the  sparse  penalty  || /1||  i  :=  YU  j  \Aij\- 

Now  suppose  we  also  have  a  set  of  non-sequence  observations  (z *}™=1  drawn  independently 
from  the  stationary  distribution  of  (4.1).  Note  that  we  use  superscripts  for  time  indices  and  sub¬ 
scripts  for  data  indices.  As  described  in  Chapter  1,  the  size  n  of  the  non-sequence  sample  can 
usually  be  much  larger  than  the  size  T  of  the  sequence  data.  To  incorporate  the  non-sequence 
observations  into  the  estimation  procedure,  we  first  obtain  a  covariance  estimate  Q  of  the  station¬ 
ary  distribution  from  the  non-sequence  sample,  and  then  turn  the  Lyapunov  equation  (4.2)  into  a 
regularization  term  on  A.  More  precisely,  in  addition  to  the  usual  ridge  or  sparse  penalty  terms, 
we  also  consider  the  following  regularization: 

\\ATQA  +  a2I  -Q\\2f,  (4.4) 

which  we  refer  to  as  the  Lyapunov  penalty.  To  compare  (4.4)  with  the  ridge  penalty  and  the  sparse 
penalty,  we  consider  (4.3)  as  a  multiple-response  regression  problem  and  view  the  i-th  column 
of  A  as  the  regression  coefficient  vector  for  the  i-th  output  dimension.  From  this  viewpoint,  we 
immediately  see  that  both  the  ridge  and  the  sparse  penalizations  treat  the  p  regression  problems  as 
unrelated.  On  the  contrary,  the  Lyapunov  penalty  incorporates  relations  between  pairs  of  columns 
of  A  by  using  a  covariance  estimate  Q.  In  other  words,  although  the  non-sequence  sample  does 
not  provide  direct  information  about  the  individual  regression  problems,  it  does  reveal  how  the 
regression  problems  are  related  to  one  another.  To  illustrate  how  the  Lyapunov  penalty  may  help 
to  improve  learning,  we  give  an  example  in  Figure  4.1.  The  true  transition  matrix  is 

I” — 0.4280  0.5723  1 

A  ~  [-1.0428  -0.7144J 

and  et  ~  J\f( 0,  /).  We  generate  a  sequence  of  4  points,  draw  a  non-sequence  sample  of  20  points 
independently  from  the  stationary  distribution  and  obtain  the  sample  covariance  Q.  We  fix  the 
second  column  of  A  but  vary  the  first,  and  plot  in  Figure  4.1(a)  the  resulting  level  sets  of  the  sum 
of  squared  errors  on  the  sequence  (SSE)  and  the  ridge  penalty  (Ridge),  and  in  Figure  4.1(b)  the 
level  sets  of  the  Lyapunov  penalty  (Lyap).  We  also  give  coordinates  of  the  true  [An  A2 i]T,  the 
minima  of  SSE,  Ridge,  and  Lyap,  respectively.  To  see  the  behavior  of  the  ridge  regression,  we 
trace  out  a  path  of  the  ridge  regression  solution  by  varying  the  penalization  parameter,  as  indi¬ 
cated  by  the  red-to-black  curve  in  Figure  4.1(a).  This  path  is  pretty  far  from  the  true  model,  due 
to  insufficient  sequence  data.  For  the  Lyapunov  penalty,  we  observe  that  it  has  two  local  minima, 
one  of  which  is  very  close  to  the  true  model,  while  the  other,  also  the  global  minimum,  is  very 
far.  Thus,  neither  ridge  regression  nor  the  Lyapunov  penalty  can  be  used  on  its  own  to  estimate 
the  true  model  well.  But  as  shown  in  Figure  4.1(c),  the  combined  objective,  SSE+Ridge+^Lyap, 
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Figure  4.1:  Level  sets  of  different  functions  in  a  bivariate  AR  example 


has  its  global  minimum  very  close  to  the  true  model.  This  demonstrates  how  the  ridge  regression 
and  the  Lyapunov  penalty  may  complement  each  other:  the  former  by  itself  gives  an  inaccurate 
estimation  of  the  true  model,  but  is  just  enough  to  identify  a  good  model  from  the  many  candidate 
local  minima  provided  by  the  latter. 

In  the  following  we  describe  our  proposed  methods  for  incorporating  the  Lyapunov  penalty 
(4.4)  into  ridge  and  sparse  least-square  estimation.  We  also  discuss  robust  estimation  for  the 
covariance  Q. 

4.1  Ridge  and  Lyapunov  penalty 

Here  we  estimate  A  by  solving  the  following  problem: 

m]n  -||V  —  +  ^>;||A||p  +  — ||AtQ^4  +  (4-6) 

where  Q  is  a  covariance  estimate  obtained  from  the  non-sequence  sample.  We  treat  Ai,  A2  and  a2 
as  hyperparameters  and  determine  their  values  on  a  validation  set.  Given  these  hyperparameters, 
we  solve  (4.6)  by  gradient  descent  with  back-tracking  line  search  for  the  step  size.  The  gradient 
of  the  objective  function  is  given  by 

-XrY  +  XtXA  +  XiA  +  A 2QA(AtQA  +  a2 1  -  Q).  (4.7) 

As  mentioned  before,  (4.6)  is  a  non-convex  problem  and  thus  requires  good  initialization.  We 
use  the  following  two  initial  estimates  of  A: 

Alsq  ■=  (XTX yXrY  and  Aridqe  :=  (XT X  +  X1I)~1XTY,  (4.8) 

where  (-)t  denotes  the  Moore-Penrose  pseudo  inverse  of  a  matrix,  making  Alsq  the  minimum- 
norm  solution  to  the  least  square  problem  (4.3).  We  run  the  gradient  descent  algorithm  with  these 
two  initial  estimates,  and  choose  the  estimated  A  that  gives  a  smaller  objective. 
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4.2  Sparse  and  Lyapunov  penalty 

Sparse  learning  for  vector  auto-regressive  models  has  become  a  useful  tool  in  many  modem 
time  series  modeling  tasks,  where  the  number  p  of  states  in  the  system  is  usually  larger  than 
the  length  T  of  the  time  series.  For  example,  an  important  problem  in  computational  biology  is 
to  understand  the  progression  of  certain  biological  processes  from  some  measurements,  such  as 
temporal  gene  expression  data. 

Using  an  idea  similar  to  (4.6),  we  estimate  A  by 

min  \\\Y  -  XAfF  +  ^\\AJQA  +  a2I -QfF,  (49) 

S.t.  ||A||i  <  Ai. 

Instead  of  adding  a  sparse  penalty  on  A  to  the  objective  function,  we  impose  a  constraint  on 
the  £1  norm  of  A.  Both  the  penalty  and  the  constraint  formulations  have  been  considered  in  the 
sparse  learning  literature,  and  shown  to  be  equivalent  in  the  case  of  a  convex  objective.  Here 
we  choose  the  constraint  formulation  because  it  can  be  solved  by  a  simple  projected  gradient 
descent  method.  On  the  contrary,  the  penalty  formulation  leads  to  a  non-smooth  and  non-convex 
optimization  problem,  which  is  difficult  to  solve  with  standard  methods  for  sparse  learning.  In 
particular,  the  soft-thresholding-based  coordinate  descent  method  for  LASSO  does  not  apply  due 
to  the  Lyapunov  regularization  term.  Moreover,  most  of  the  common  methods  for  non-smooth 
optimization,  such  as  bundle  methods,  solve  convex  problems  and  need  non-trivial  modification 
in  order  to  handle  non-convex  problems  [Noll  et  al.,  2008]. 

Let  J(A)  denote  the  objective  function  in  (4.9)  and  /\(k)  denote  the  intermediate  solution  at 
the  k-th  iteration.  Our  projected  gradient  method  updates  Alk>  to  A{k+l  >  by  the  following  rule: 

A(fc+1)  <-  n(Aw  -  r/fc)VJ(A(fc))),  (4.10) 

where  rf®  >  0  denotes  a  proper  step  size,  V ■J{Atk))  denotes  the  gradient  of  J(-)  at  y\(k>.  and 
II(-)  denotes  the  projection  onto  the  feasible  region  ||/1||  i  <  Ai.  More  precisely,  for  any  p-by-p 
real  matrix  V  we  define 

n(U)  :=  arg  min  \\A  -  V\\2F.  (4.11) 

IOIIi<Ai 

To  compute  the  projection,  we  use  the  efficient  t\  projection  technique  outlined  in  Algorithm  3.6 
of  Chapter  3. 

For  choosing  a  proper  step  size  ifk\  we  consider  the  simple  and  effective  Armijo  rule  along 
the  projection  arc  described  by  Bertsekas  [1999].  This  procedure  is  given  in  Algorithm  4.2, 
and  the  main  idea  is  to  ensure  a  sufficient  decrease  in  the  objective  value  per  iteration  (4.13). 
Bertsekas  [1999]  proved  that  there  always  exists  rf®  =  (Ak  >  0  satisfying  (4.13),  and  every 
limit  point  of  is  a  stationary  point  of  (4.9).  In  our  experiments  we  set  c  =  0.01  and 

(3  —  0.1,  both  of  which  are  typical  values  used  in  gradient  descent.  As  in  the  previous  section, 
we  need  good  initializations  for  the  projected  gradient  descent  method.  Here  we  use  these  two 
initial  estimates: 

Alsq  :=  arg  min  ||A  —  Alsq\\2F  and  Asp  :=  arg  min  -\\Y  —  XA\\2F,  (4.12) 

||A||<Ai  ||A||<Ai  2 

where  Alsq  is  defined  in  (4.8),  and  then  choose  the  one  leading  to  a  smaller  objective  value. 
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Algorithm  4.1  Armijo’s  rule  along  the  projection  arc 

Input:  A^k\  VJ(A('k'>),  0  <  /3  <  1,0  <  c  <  1 

Output: 

1:  Find  rfk/>  =  max{/3rfc  jr*,  G  {0, 1, . . .}}  such  that  :=  n(A^  —  rj^'V J (A^))  satisfies 

J(A(fc+1))  -  J(A(k) )  <  c  Tr  (VJ(A(fc))T(A(fc+1)  -  A(fc)))  (4.13) 


4.3  Robust  estimation  of  covariance  matrices 


To  obtain  a  good  estimator  for  A  using  the  proposed  methods,  we  need  a  good  estimator  for  the 
covariance  of  the  stationary  distribution  of  (4.1).  Given  an  independent  sample  {zj}f=1  drawn 
from  the  stationary  distribution,  the  sample  covariance  is  defined  as 


S  :  = 


where  z  :  = 


En 

i=  1  : 


(4.14) 


Although  unbiased,  the  sample  covariance  is  known  to  be  vulnerable  to  outliers,  and  ill-conditioned 
when  the  number  of  sample  points  n  is  smaller  than  the  dimension  p.  Both  issues  arise  in 
many  real  world  problems,  and  the  latter  is  particularly  common  in  gene  expression  analy¬ 
sis.  Therefore,  researchers  in  many  fields,  such  as  statistics  [Ledoit  and  Wolf,  2004;  Stein, 
1975;  Yang  and  Berger,  1994],  finance  [Ledoit  and  Wolf,  2003],  signal  processing  [Chen  et  al., 
2010a,b],  and  recently  computational  biology  [Schafer  and  Strimmer,  2005],  have  investigated 
robust  estimators  of  covariances.  Most  of  these  results  originate  from  the  idea  of  shrinkage 
estimators ,  which  shrink  the  covariance  matrix  towards  some  target  covariance  with  a  simple 
structure,  such  as  a  diagonal  matrix.  It  has  been  shown  by,  e.g.,  Ledoit  and  Wolf  [2003];  Stein 
[1975]  that  shrinking  the  sample  covariance  can  achieve  a  smaller  mean-squared  error  (MSE). 
More  specifically,  Ledoit  and  Wolf  [2003]  consider  the  following  linear  shrinkage: 


Q  =  (1  -a)S  +  aF 


(4.15) 


for  0  <  a  <  1  and  some  target  covariance  F,  and  derive  a  formula  for  the  optimal  a  that 
minimizes  the  mean-squared  error: 


a*  :=  arg  min  E(||Q  -  Q\\2F), 

0<a<l 


(4.16) 


which  involves  unknown  quantities  such  as  true  covariances  of  S.  Schafer  and  Strimmer  [2005] 
propose  to  estimate  a*  by  replacing  all  the  population  quantities  appearing  in  a*  by  their  un¬ 
biased  empirical  estimates,  and  derived  the  resulting  estimator  a*  for  several  types  of  target  F. 
For  the  experiments  later  in  this  chapter  we  use  the  estimator  proposed  by  Schafer  and  Strimmer 
[2005]  with  the  following  F: 


F  = 


Sij ,  if  i  j, 

0  otherwise, 


1  <  i,j  <  p. 


(4.17) 
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Denoting  the  sample  correlation  matrix  as  R,  we  give  in  below  the  final  estimator  Q  [Table  1, 
Schafer  and  Strimmer,  2005]: 


Q.  . 


if  i  =  j, 


Rij  y/ SuSjj  otherwise, 


Rij  ■ 


1, 

Rij  min(l,  max(0, 1  —  a*)) 


— * 

a  := 


E*,  Var(fi: 


V) 


(n-l)3  Ylk=l(Wkij  Wij / 


if  i  =  j, 

otherwise, 

(4.18) 

(4.19) 


where 

sr^n 

Wkij  :=  (Zfc)i(zfc)j,  Wij  :=  — fc=i  WklJ  ,  (4.20) 

and  (z j}”=1  are  standardized  non-sequence  samples. 


4.4  Experiments 


To  evaluate  the  proposed  methods,  we  conduct  experiments  on  synthetic  and  video  data.  We  use 
the  same  performance  metrics  as  in  Chapter  3  for  evaluating  a  learnt  model  A: 


Normalized  error: 


1 


T  -  1 


T— 1 


£ 

t= i 


xt+i  _  xi^4|| 
jxm  —  x*  II 


Cosine  score: 


1 


T-  1 


T—l 


£ 


(xi+1  -x*)T(x*A-xi) 
||xt+1  —  X*  ||  ||xM  —  X*  || 


In  experiments  on  synthetic  data  we  have  the  true  transition  matrix  A,  so  we  consider  a  third 
criterion,  the  matrix  error:  ||  A  —  2l||F/||2l||F. 

In  all  our  experiments,  we  have  a  training  sequence,  a  testing  sequence,  and  a  non-sequence 
sample.  To  choose  the  hyper-parameters  Ai,A2  and  a1,  we  split  the  training  sequence  into 
two  halves  and  use  the  second  half  as  the  validation  sequence.  Once  we  find  the  best  hyper¬ 
parameters  according  to  the  validation  performance,  we  train  a  model  on  the  full  training  se¬ 
quence  and  predict  on  the  testing  sequence.  For  Ai  and  A2,  we  adopt  the  usual  grid-search 
scheme  with  a  suitable  range  of  values.  For  a2,  we  observe  that  (4.2)  implies  Q  —  a2 1  should 
be  positive  semidefinite,  and  thus  search  the  set  { 0 . min.,  A i(Q)  \  1  <  j  <  3}.  In  most  of  our 
experiments,  we  find  that  the  proposed  methods  are  much  less  sensitive  to  a2  than  to  Ai  and  A2. 


4.4.1  Synthetic  Data 

We  consider  the  following  two  VAR  models  with  Gaussian  noise  et  ~  J\f( 0, 1). 

0  95M 

Dense  Model:  A  = - —  ,  Mt]  ~  A/"(0, 1),  1  <  i,  j  <  200. 

max(|Aj(M)|) 

Sparse  Model:  A  =  — ,  ,  MtJ  ~  A/"(0, 1),  Dt]  ~  Bern  (1/8),  1  <  i,j  <  200, 

max(|Aj(M  ©  B)\) 
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Figure  4.2:  Testing  performances  and  eigenvalues  in  modulus  for  the  dense  model 


where  Bem(/i)  is  the  Bernoulli  distribution  with  success  probability  h,  and  ©  denotes  the  entry- 
wise  product  of  two  matrices.  By  setting  h  —  1/8,  we  make  the  sparse  transition  matrix  A  have 
roughly  40000/8  =  5000  non-zero  entries.  Both  models  are  stable,  and  the  stationary  distribution 
for  each  model  is  a  zero-mean  Gaussian.  We  obtain  the  covariance  Q  of  each  stationary  distri¬ 
bution  by  solving  the  Lyapunov  equation  (4.2).  For  a  single  experiment,  we  generate  a  training 
sequence  and  a  testing  sequence,  both  initialized  from  the  stationary  distribution,  and  draw  a  non¬ 
sequence  sample  independently  from  the  stationary  distribution.  We  set  the  length  of  the  testing 
sequence  to  be  800,  and  vary  the  training  sequence  length  T  and  the  non-sequence  sample  size 
n:  for  the  dense  model,  T  £  (50, 100, 150,  200,  300, 400,  600,  800}  and  n  £  (50, 400, 1600};  for 
the  sparse  model,  T  £  (25,  75, 150, 400}  and  n  £  (50, 400, 1600}.  Under  each  combination  of 
T  and  n,  we  compare  the  proposed  Lyapunov  penalization  method  with  the  baseline  approach 
of  penalized  least  square,  which  uses  only  the  sequence  data.  To  investigate  the  limit  of  the 
proposed  methods,  we  also  use  the  true  Q  for  the  Lyapunov  penalization.  We  run  10  such  exper¬ 
iments  for  the  dense  model  and  5  for  the  sparse  model,  and  report  the  overall  performances  of 
both  the  proposed  and  the  baseline  methods. 


Experimental  results  for  the  dense  model 

We  give  boxplots  of  the  three  performance  measures  in  the  10  experiments  in  Figures  4.2(a)  to 
4.2(c).  The  ridge  regression  approach  and  the  proposed  Lyapunov  penalization  method  (4.6)  are 
abbreviated  as  Ridge  and  Lyap,  respectively.  For  normalized  error  and  cosine  score,  we  also 
report  the  performance  of  the  true  A  on  testing  sequences. 

We  observe  that  Lyap  improves  over  Ridge  more  significantly  when  the  training  sequence 
length  T  is  small  (<  200)  and  the  non-sequence  sample  size  n  is  large  (>  400).  When  T  is 
large,  Ridge  already  performs  quite  well  and  Lyap  does  not  improve  the  performance  much.  But 
with  the  true  stationary  covariance  Q,  Lyap  outperforms  Ridge  significantly  for  all  T.  When  n 
is  small,  the  covariance  estimate  Q  is  far  from  the  true  0  and  the  Lyapunov  penalty  does  not 
provide  useful  information  about  A.  In  this  case,  the  value  of  A2  determined  by  the  validation 
performance  is  usually  quite  small  (0.5  or  1)  compared  to  Ai  (256),  so  the  two  methods  perform 
similarly  on  testing  sequences.  We  note  that  if  instead  of  the  robust  covariance  estimate  in  (4.18) 
and  (4.19)  we  use  the  sample  covariance,  the  performance  of  Lyap  can  be  marginally  worse  than 
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Figure  4.3:  Testing  performances  and  eigenvalues  in  modulus  for  the  sparse  model 


Ridge  when  n  is  small.  A  precise  statement  on  how  the  estimation  error  in  0  affects  A  is  worth 
studying  in  the  future.  As  a  qualitative  assessment  of  the  estimated  transition  matrices,  in  Figure 
4.2(d)  we  plot  the  eigenvalues  in  modulus  of  the  true  A  and  the  A’s  obtained  by  different  methods 
when  T  =  50  and  n  =  1600.  The  eigenvalues  are  sorted  according  to  their  modulus.  Both  Ridge 
and  Lyap  severely  under-estimate  the  eigenvalues  in  modulus,  but  Lyap  preserves  the  spectrum 
much  better  than  Ridge. 

Experimental  results  for  the  sparse  model 

We  give  boxplots  of  the  performance  measures  in  the  5  experiments  in  Figures  4.3(a)  to  4.3(c), 
and  the  eigenvalues  in  modulus  of  the  true  A  and  some  A’s  in  Figure  4.3(d).  The  sparse  least- 
square  method  and  the  proposed  method  (4.9)  are  abbreviated  as  Sparse  and  Lyap,  respectively. 

We  observe  the  same  type  of  improvement  as  in  the  dense  model:  Lyap  improves  over  Sparse 
more  significantly  when  T  is  small  and  n  is  large.  But  the  largest  improvement  occurs  when 
T  =  75,  not  the  shortest  training  sequence  length  T  =  25.  A  major  difference  lies  in  the  impact 
of  the  Lyapunov  penalization  on  the  spectrum  of  A,  as  revealed  in  Figure  4.3(d).  When  T  is 
as  small  as  25,  the  sparse  least-square  method  shrinks  all  the  eigenvalues  but  still  keep  most  of 
them  non-zero,  while  Lyap  with  a  non-sequence  sample  of  size  1600  over-estimates  the  first  few 
largest  eigenvalues  in  modulus  but  shrink  the  rest  to  have  very  small  modulus.  In  contrast,  Lyap 
with  the  true  Q  preserves  the  spectrum  much  better.  We  may  thus  need  an  even  better  covariance 
estimate  for  the  sparse  model. 


4.4.2  Video  Data 

We  test  our  methods  using  a  video  sequence  of  a  periodically  swinging  pendulum,  which  is  cut 
from  the  video  used  in  Chapter  3  and  consists  of  500  frames  of  75-by-80  grayscale  images. 
One  such  frame  is  given  in  Figure  4.4(a)  The  period  is  about  23  frames.  To  further  reduce  the 
dimension  we  take  the  second-level  Gaussian  pyramids,  resulting  in  images  of  size  9-by-l  1.  We 
then  treat  each  reduced  image  as  a  99-dimensional  vector,  and  normalize  each  dimension  to  be 
zero-mean  and  standard  deviation  1 .  We  analyze  this  sequence  with  a  99-dimensional  first-order 
VAR  model.  To  check  whether  a  VAR  model  is  a  suitable  choice,  we  estimate  a  transition  matrix 
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Figure  4.4:  Results  on  the  pendulum  video  data 


from  the  first  400  frames  by  ridge  regression  while  choosing  the  penalization  parameter  on  the 
next  50  frames,  and  predict  on  the  last  50  frames.  The  best  penalization  parameter  is  0.0156,  and 
the  testing  normalized  error  and  cosine  score  are  0.33  and  0.97,  respectively,  suggesting  that  the 
dynamics  of  the  video  sequence  is  well-captured  by  a  VAR  model. 

We  compare  the  proposed  method  (4.6)  with  the  ridge  regression  for  two  lengths  of  the  train¬ 
ing  sequence:  T  e  {6, 10,  20,  50},  and  treat  the  last  50  frames  as  the  testing  sequence.  For  both 
methods,  we  split  the  training  sequence  into  two  halves  and  use  the  second  half  as  a  validation 
sequence.  For  the  proposed  method,  we  simulate  a  non-sequence  sample  by  randomly  choosing 
300  frames  from  between  the  (T  +  l)-st  frame  and  the  450-th  frame  without  replacement.  We 
repeat  this  10  times. 

The  testing  normalized  errors  and  cosine  scores  of  both  methods  are  given  in  Figures  4.4(b) 
and  4.4(c).  For  the  proposed  method,  we  report  the  mean  performance  measures  over  the  10 
simulated  non-sequence  samples  with  standard  deviation.  When  T  <  20,  which  is  close  to 
the  period,  the  proposed  method  outperforms  ridge  regression  very  significantly  except  when 
T  =  10  the  cosine  score  of  Lyap  is  barely  better  than  Ridge.  However,  when  we  increase  T  to 
50,  the  difference  between  the  two  methods  vanishes,  even  though  there  is  still  much  room  for 
improvement  as  indicated  by  the  result  of  our  model  sanity  check  before.  This  may  be  due  to 
our  use  of  dependent  data  as  the  non-sequence  sample,  or  simply  insufficient  non-sequence  data. 
As  for  Ai  and  A2,  their  values  decrease  respectively  from  512  and  2,048  to  less  than  32  as  T 
increases,  but  since  we  fix  the  amount  of  non-sequence  data,  the  interaction  between  their  value 
changes  is  less  clear  than  on  the  synthetic  data. 


4.5  Conclusion 

In  this  chapter  we  propose  to  improve  penalized  least-square  estimation  of  VAR  models  by  incor¬ 
porating  non-sequence  data  independently  drawn  from  the  stationary  distribution  of  the  under¬ 
lying  VAR  model.  We  construct  a  novel  penalization  term  based  on  the  discrete-time  Lyapunov 
equation  incorporating  the  covariance  (estimate)  of  the  stationary  distribution.  Although  the  re¬ 
sulting  optimization  problems  are  non-convex,  the  standard  least-square  solution  obtained  by 
using  only  sequence  data  often  serves  as  a  good  initial  point,  reducing  the  need  for  multiple 
random  initializations.  Experimental  results  demonstrate  that  our  methods  can  improve  signifi- 
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cantly  over  standard  penalized  least-square  methods  when  there  are  only  few  sequence  data  but 
abundant  non-sequence  data  and  when  the  model  assumption  is  valid.  Future  directions  include 
investigating  the  impact  of  Q  on  A  in  a  precise  manner,  generalizing  the  proposed  Lyapunov 
penalization  scheme  to  handle  general  noise  covariances,  and  applying  the  proposed  methods  to 
other  real-world  data. 
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Chapter  5 

Learning  Hidden  Markov  Models  from 
Non-sequence  Data 


In  this  and  the  next  chapters  we  turn  to  learning  hidden  Markov  models  (HMMs)  from  non¬ 
sequence  data.  At  first  glance  this  seems  to  be  an  unthinkable  attempt  because,  as  shown  in 
Chapter  3,  it  is  not  clear  how  to  deal  with  the  various  identifiability  issues  that  can  seriously 
compromise  learning  of  the  fully  observable  VAR  model,  let  alone  the  more  complicated  HMM, 
whose  estimation  is  challenging  even  in  the  usual  sequential  learning  setting.  One  major  hurdle 
lies  in  the  use  of  the  EM  learning  paradigm,  which  often  casts  learning  as  a  highly  non-convex 
optimization  problem  due  to  hidden  variables  and  consequently  suffers  from  multiple  local  op¬ 
tima  with  no  guarantee.  Moreover,  the  EM  approach  usually  does  not  shed  much  light  on  ways 
to  reduce  the  ambiguity  of  the  learning  problem  without  making  strong  assumptions,  because 
as  long  as  the  resulting  optimization  problem  remains  non-convex,  formal  analysis  of  learning 
guarantees  is  still  formidable. 

We  thus  propose  to  take  a  different  approach,  based  on  another  long-standing  estimation 
principle:  the  method  of  moments  (MoM).  The  basic  idea  of  MoM  is  to  find  model  parameters 
such  that  the  resulting  moments  match  or  resemble  the  empirical  moments.  For  some  estimation 
problems,  this  approach  is  able  to  give  unique  and  consistent  estimates  while  the  maximum- 
likelihood  method  gets  entangled  in  multiple  and  potentially  undesirable  local  maxima.  Taking 
advantage  of  this  property,  an  emerging  area  of  research  in  machine  learning  has  recently  devel¬ 
oped  MoM-based  learning  algorithms  with  formal  guarantees  for  some  widely  used  latent  vari¬ 
able  models,  such  as  Gaussian  Mixture  Models  [Hsu  and  Kakade,  2013],  Hidden  Markov  Models 
[Anandkumar  et  al.,  2012b],  Latent  Dirichlet  Allocation  [Anandkumar  et  al.,  2013;  Arora  et  al., 
2012],  etc.  Although  many  learning  algorithms  for  these  models  exist,  some  having  been  very 
successful  in  practice,  barely  any  formal  learning  guarantee  was  given  until  the  MoM-based 
methods  were  proposed.  Such  breakthroughs  seem  surprising,  but  it  turns  out  that  they  are  mostly 
based  on  one  crucial  property:  for  quite  a  few  latent  variable  models,  the  model  parameters  can 
be  uniquely  determined  from  spectral  decompositions  of  certain  low-order  moments  of  observ¬ 
able  quantities. 

In  this  chapter  we  demonstrate  that  under  the  MoM  and  spectral  learning  framework,  there  are 
reasonable  assumptions  on  the  generative  process  of  non-sequence  data,  under  which  the  tensor 
decomposition  method  [Anandkumar  et  al.,  2012a],  a  recent  advancement  in  spectral  learning, 
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can  provably  recover  the  parameters  of  certain  types  of  first-order  Markov  models  and  hidden 
Markov  models.  To  the  best  of  our  knowledge,  ours  is  the  first  work  that  provides  formal  guar¬ 
antees  for  learning  from  non-sequence  data  in  terms  of  parameter  estimation  accuracy.  Interest¬ 
ingly,  these  assumptions  bear  much  similarity  to  the  usual  idea  behind  topic  modeling :  with  the 
bag-of-words  representation  which  is  invariant  to  word  orderings,  the  task  of  inferring  topics  is 
almost  impossible  given  one  single  document  (no  matter  how  long  it  is!),  but  becomes  easier  as 
more  documents  touching  upon  various  topics  become  available.  For  learning  dynamic  models, 
what  we  need  in  the  non-sequence  data  are  multiple  sets  of  observations,  where  each  set  contains 
independent  samples  generated  from  its  own  initial  distribution,  and  the  many  different  initial 
distributions  together  cover  the  entire  (hidden)  state  space.  In  some  of  the  scientific  applications 
described  in  Chapter  1,  such  as  biological  studies,  this  type  of  assumptions  might  be  realized  by 
running  multiple  experiments  with  different  initial  configurations  or  amounts  of  stimuli. 

This  chapter  consists  of  four  sections.  Section  5.1  reviews  the  essentials  of  the  tensor  de¬ 
composition  framework  [Anandkumar  et  al.,  2012a];  Section  5.2  details  our  assumptions  on 
non-sequence  data,  tensor-decomposition  based  learning  algorithms,  and  theoretical  guarantees; 
Section  5.3  reports  some  simulation  results  confirming  our  theoretical  findings,  followed  by  con¬ 
clusions  in  Section  5.4.  Proofs  of  theoretical  results  are  given  in  Appendix  B. 


5.1  Tensor  Decomposition 

We  briefly  introduce  the  tensor  decomposition  framework  [Anandkumar  et  al.,  2012a],  mainly 
following  their  exposition  and  describing  only  the  components  necessary  for  our  work.  We 
first  give  some  preliminaries  and  notations.  A  real  p-th  order  tensor  A  is  a  member  of  the 
tensor  product  space  (S)'-=  \  Rm'  of  p  Euclidean  spaces.  For  a  vector  x  e  W"\  we  denote  by 
x®p  :=  x  <g>  x  <g)  •  •  •  0  x  G  0iLi  its  p-th  tensor  power.  A  convenient  way  to  represent 
A  £  <S>i=i Rm  is  through  a  p-way  array  of  real  numbers  [Ahi2...ip]  where  Aili2...ip 

denotes  the  (ii,  i2, . . . ,  ip)-th  coordinate  of  A  with  respect  to  a  canonical  basis.  With  this  repre¬ 
sentation,  we  can  view  A  as  a  multi-linear  map  that,  given  a  set  of  p  matrices  {Xt  e 
produces  another  p-th  order  tensor  A(X1,X2,  •  •  •  ,  Xp)  e  (S)'L  i  Rm’  with  the  following  p-way 
array  representation: 


A(Ai,  A2,  •  •  •  ,  Xp) j1i2...ip  •  ^  ^  Aj1j2...jp(Xi)j1i1(X2)j2i2---(Xp)jpip.  (5.1) 

In  this  work  we  consider  tensors  that  are  up  to  the  third-order  (p  <  3)  and,  for  most  of  the  time, 
also  symmetric,  meaning  that  their  p-way  array  representations  are  invariant  under  permutations 
of  array  indices.  More  specifically,  we  focus  on  second  and  third-order  symmetric  tensors  in  or 
slightly  perturbed  from  the  following  form: 

k  k 

M2  ■■=  M3  :=  Pi®  Hi,  (5.2) 

1=1  2=1 

satisfying  the  following  non-degeneracy  conditions: 

Condition  1.  cUj  >  0  V  1  <  i  <  k,  (Mi  e  R are  linearly  independent,  and  k  <  m. 
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Algorithm  5.1  Whitening  transformation 

input  A  symmetric  matrix  M2  G  Rmx"\  a  symmetric  third-order  tensor  M3  G  R'"  xmx,rt,  and 
the  target  dimension  k. 

output  A  reduced  third-order  tensor  T  ^]^kxkxk  and  a  whitening  transformation  W  G  Wixk. 

1:  Compute  W  :=  QD  l/2,  where  0  €  Km x k  denotes  the  top- A*  orthonormal  eigenvectors  of 
M2,  and  D  G  Mfcxfc  is  a  diagonal  matrix  of  the  corresponding  k  positive  eigenvalues. 

2:  Compute  T  :=  Af3(W,  W,  W). 


As  described  in  later  sections,  the  core  of  our  learning  task  is  the  problem  of  estimating 
Mir  and  {f. from  perturbed  or  noisy  versions  of  M2  and  M3,  which  we  solve  with 
the  tensor  decomposition  method  recently  proposed  by  Anandkumar  et  al.  [2012a],  summarized 
below.  Suppose  the  noiseless  M2  and  M3  are  available,  we  first  perform  a  whitening  step  on 
them,  as  outlined  in  Algorithm  5.1,  to  obtain  a  whitened,  lower-dimensional  tensor  T  G  M A: x k x k 
and  a  whitening  transformation  W  G  Wmxk  such  that 

k  k 

T  :=  M*(W,W,W)  =  ^(W^)®3  =  ^  —  P® 

V  ^2 

2=1  2=1  v 

where  the  vectors  /i  •  :=  form  an  orthonormal  basis  of  because  /  =  WT M2W  = 

P'Pt  Hence,  the  symmetric  tensor  T  has  a  so-called 
orthogonal  decomposition ,  which  may  not  exist  for  general  symmetric  tensors.  Then  by  Theorem 
4.3  of  [Anandkumar  et  al.,  2012a],  which  establishes  the  following  results  under  Condition  1: 

1.  the  set  of  robust  eigenvectors  (c.f.  Section  4.2  of  [Anandkumar  et  al.,  2012a])  of  T  corre¬ 
spond  exactly  to  {/ii}f=1; 

2.  the  eigenvalue  associated  with  is  equal  to  1  / y/ul,  V  1  <  i  <  k; 

3.  if  (v,  A)  is  a  pair  of  robust  eigenvector/eigenvalue  of  T,  then  pL{  =  A(kkT)tv  for  some 
1  <  i  <  k,  where  f  denotes  the  Moore-Penrose  pseudo  inverse; 

we  can  reduce  the  original  problem  of  recovering  the  structure  in  (5.2)  into  a  robust  tensor 
eigen-decomposition  problem.  Motivated  by  power  iteration  for  matrix  eigen  computation, 
Anandkumar  et  al.  [2012a]  verify  that  starting  from  almost  every  vector  0O  G  Mfc,  the  tensor 
power  iteration 

T(I,Ot-i,0t-i) 

t  ■  ||T(/,0t_1,0t_1)||’ 

where  ||  ■  ||  denotes  the  vector  2-norm,  converges  to  some  robust  eigenvector  of  T  at  a  quadratic 
rate,  and  therefore  k  successive  applications  of  tensor  power  iteration  with  deflation  result  in  all 
pairs  of  robust  eigenvectors/eigenvalues. 

In  practice  we  almost  never  have  the  exact  M2  and  M3,  but  only  noisy  or  perturbed  versions 
M2  and  M3,  which  are  usually  estimates  from  the  data.  Perturbation  may  destroy  the  tensor  struc¬ 
ture  (5.2),  so  the  reduced  tensor  T  resulting  from  applying  Algorithm  5.1  to  M2  and  M3  may  no 
longer  be  orthogonally  decomposable,  hindering  the  subsequent  robust  tensor  eigendecomposi- 
tion.  Nevertheless,  Anandkumar  et  al.  [2012a]  demonstrate  that  if  the  perturbation  E  :=  T  —  T 
is  a  symmetric  tensor  with  a  small  operator  norm  defined  as  ||£,||  :=  sup|  0||=1  | E(6,  6,  G)\, 
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Algorithm  5.2  Robust  tensor  power  method 

input  A  symmetric  tensor  T  e  l;':x,':xA\  number  of  iterations  L,  N. 

output  the  estimated  eigenvector/eigenvalue  pair;  the  deflated  tensor. 

1:  for  r  =  1  to  L  do 

2:  Draw  Bf]  uniformly  at  random  from  the  unit  sphere  in  Rfc. 

3:  for  t  =  1  to  N  do 

,  0T  ,=  npeffeffi) 

5:  end  for 

6:  end  for 

7:  Let  t*  :=  arg  maxi<r<L  T  ( 6 ^t),  8{f\  6 if)). 

8:  Do  N  power  iteration  updates  (Line  4)  starting  from  0^  to  obtain  6 ,  and  set  A  :=  T (B,  6 ,  6 ) 
9:  return  the  estimated  eigenvector/eigenvalue  pair  (6,  A);  the  deflated  tensor  T  —  XB  . 


then  k  successive  applications  of  some  randomized  tensor  power  iteration  coupled  with  deflation 
yield  accurate  estimates  of  all  robust  eigenvector/eigenvalue  pairs  with  high  probability.  More 
precisely,  they  propose  die  Robust  tensor  power  method  outlined  in  Algorithm  5.2,  which  em¬ 
ploys  multiple  random  restarts,  and  provide  a  theoretical  guarantee  on  its  robustness  against  the 
input  perturbation: 

Theorem  1.  (Theorem  5.1  of  [Anandkumar  et  ai,  2B12a])  Let  T  =  T  +  E  e  M.kxkxk,  where 
T  is  a  symmetric  tensor  with  orthogonal  decomposition  T  =  Y^l=i  A4vf3  where  each  A  i  >  0, 
(vi,  v2, . . . ,  V/,}  is  an  orthonormal  basis,  and  E  has  operator  norm  e  :=  ||L||.  Define  Amin  :  = 
nrin({Aj}f=1)  and  Amax  :=  max({Aj}f=1).  There  exists  universal  constants  c\ ,  C9,  C;s  >  0  such 
that  the  following  holds.  Pick  any  rj  G  (0,1),  and  suppose 

e<  Cl-  N  >  c2  •  (log (k)  +  log log(Amax/ e)) ,  and 

I ln(L/log2(^))  _  /  _  ln(ln(L/log2(jp))  +  c3  _  /  ln(8)  \  /  [H^\ 

V  ln(fc)  V  41n(L/log2(*p)  Y  ln(L/log2(^))y  “  '  ^  \j  hfik)  J  ' 

(Note  that  the  condition  on  L  holds  with  L  =  poly(k)  log(l /rj).)  Suppose  that  Algorithm  5.2  is 
iteratively  called  k  times  with  numbers  of  iterations  L  and  N,  where  the  input  tensor  is  T  in  the 
first  call,  and  in  each  subsequent  call,  the  input  tensor  is  the  deflated  tensor  returned  by  the  previ¬ 
ous  call.  Let  (vi,  Ai),  (v2,  A2), . . . ,  (v*,,  A/,.)  be  the  sequence  of  estimated  eigenvector/eigenvalue 
pairs  returned  in  these  k  calls.  With  probability  at  least  1  —  rj,  there  exists  a  permutation  p  on 
(1, . . . ,  k}  such  that 

k 

IIvp07  'Vj||  <  8e/Ap(j),  |Ap(j)  —  Xj\  <  5e,  VI  <  j  <  k,  and  \\T  —  AjVj>3||  <  55e. 

j= i 

This  result,  together  with  existing  perturbation  theory  regarding  the  whitening  procedure 
(e.g.,  Appendix  C.l  of  [Anandkumar  et  al.,  2013]),  allow  us  to  translate  the  perturbations  in  M2 
and  M3  into  the  estimation  errors  in  cu's  and  nfs,  guaranteeing  accurate  estimation  under  small 
input  perturbation. 
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Figure  5.1:  Running  example  of  Markov  chain  with  three  states 


5.2  Learning  from  Non-sequence  Data 

We  first  describe  a  generative  process  of  non-sequence  data  for  first-order  Markov  models  and 
demonstrate  how  to  apply  tensor  decomposition  methods  to  perform  consistent  learning.  Then 
we  extend  these  ideas  to  hidden  Markov  models  and  provide  theoretical  guarantees  on  the  sam¬ 
ple  complexity  of  the  proposed  learning  algorithm.  For  notational  conveniences  we  define  the 
following  vector-matrix  cross  product  <£>^<={1,2,3}  :  (v  <8>i  M)ijk  :=  Vi(M)jk ,  (v  ®2  M)ijk  = 
M)ijk  =  vk(M)ij.  For  a  matrix  M  we  denote  by  Mt  its  /-th  column. 


5.2.1  First-order  Markov  Models 

Let  P  G  [0,  l]mxm  be  the  transition  probability  matrix  of  a  discrete,  first-order,  ergodic  Markov 
chain  with  m  states  and  a  unique  stationary  distribution  tv.  Let  P  be  of  full  rank  and  1 T P  =  1T. 
To  give  a  high-level  idea  of  what  makes  it  possible  to  learn  P  from  non-sequence  data,  we 
use  the  simple  Markov  chain  with  three  states  shown  in  Figure  5.1  as  our  running  example, 
demonstrating  step  by  step  how  to  extend  from  a  very  restrictive  generative  setting  of  the  data 
to  a  reasonably  general  setting,  along  with  the  assumptions  made  to  allow  consistent  parameter 
estimation.  In  the  usual  setting  where  we  have  sequences  of  observations,  say  {x^\x(2\  . . .} 
with  parenthesized  superscripts  denoting  time,  it  is  straightforward  to  consistently  estimate  P. 
We  simply  calculate  the  empirical  frequency  of  consecutive  pairs  of  states: 


l(x(t+1)  =  i,  x^  =  j) 

Ei  !(x(i)  =  j) 


Alternatively,  suppose  for  each  state  j,  we  have  an  i.i.d.  sample  of  its  immediate  next  state 
Dj  :=  {x^jX^, ...  |  x^  =  j},  where  subscripts  are  data  indices.  Consistent  estimation  in 
this  case  is  also  easy:  the  empirical  distribution  of  Dj  consistently  estimates  Pj ,  the  7 -th  column 
of  P.  For  example,  the  Markov  chain  in  Figure  5.1  may  produce  the  following  three  samples, 
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whose  empirical  distributions  estimate  the  three  columns  of  P  respectively: 


D\  =  {2,1,2,2,2,2,2,2,2,2}  =>  P,  =  [0.1  0.9  0.0]T, 

D2  =  {3,3,2,3,2,3,3,2,3,3}  =>  P2  =  [0.0  0.3  0.7]T, 

D3  =  (1,1, 3, 1,3, 3, 1,3, 3,1}  =>  Ps  =  [0.5  0.0  0.5]T. 

A  nice  property  of  these  estimates  is  that,  unlike  in  the  sequential  setting,  they  do  not  depend 
on  any  particular  ordering  of  the  observations  in  each  set.  Nevertheless,  such  data  are  not  quite 
non-sequenced  because  all  observations  are  made  at  exactly  the  next  time  step.  We  thus  consider 
the  following  generalization:  for  each  state  j,  we  have  Dj  :=  (xf1^,  2\  . . .  |  x(0)  =  j},  i.e., 

independent  samples  of  states  drawn  at  unknown  future  times  {ti,  t2, . . .}.  For  example,  our  data 
in  this  setting  might  be 


D1  =  {2,1,2,3,2,3,3,2,2,3}, 

D2  =  {3,3,2,3,2,1,3,2,3,1},  (5.3) 

D3  =  {1,1, 3, 1,2, 3, 2, 3, 3, 2}. 

Obviously  it  is  hard  to  extract  information  about  P  from  such  data.  However,  if  we  assume 
that  the  unknown  times  {ti}  are  i.i.d.  random  variables  following  some  distribution  independent 
of  the  initial  state  j,  it  can  then  be  easily  shown  that  Dj’ s  empirical  distribution  consistently 
estimates  Tv  the  j-th  column  of  the  the  expected  transition  probability  matrix  T  :=  E, 1 P1'] : 

Di  =  {2,1,2,3,2,3,3,2,2,3}  =>•  T,  =  [0.1  0.5  0.4]T, 

D2  =  {3,3,2,3,2,1,3,2,3,1}  =>  f2  =  [0.2  0.3  0.5]T, 

D3  =  {1,1,3,1,2,3,2,3,3,2}  =>  f3  =  [0.3  0.3  0.4]T. 

In  general  there  exist  many  P’s  that  result  in  the  same  T.  Therefore,  as  detailed  later,  we  make 

a  specific  distributional  assumption  on  {ti}  to  enable  unique  recovery  of  the  transition  matrix  P 
from  T  (Assumption  A.l).  Next  we  consider  a  further  generalization,  where  the  unknowns  are 
not  only  the  time  stamps  of  the  observations,  but  also  the  initial  state  j.  In  other  words,  we  only 
know  each  set  was  generated  from  the  same  initial  state,  but  do  not  know  the  actual  initial  state. 
In  this  case,  the  empirical  distributions  of  the  sets  consistently  estimate  the  columns  of  T  in  some 
unknown  permutation  II: 


D] 1(3) 

=  {1,1, 3, 1,2, 3, 2, 3, 3, 2} 

rn(3)  = 

[0.3  0.3  0.4] T 

D] 1(2) 

=  {3, 3, 2, 3, 2, 1,3, 2, 3,1} 

rn(2)  = 

[0.2  0.3  0.5] T 

^n(i) 

=  {2, 1,2, 3, 2, 3, 3, 2, 2, 3} 

Tm)  = 

[0.1  0.5  0.4] T 

In  order  to  be  able  to  identify  II,  we  will  again  resort  to  randomness  and  assume  the  unknown 
initial  states  are  random  variables  following  a  certain  distribution  (Assumption  A. 2)  so  that  the 
data  carry  information  about  II.  Finally,  we  generalize  from  a  single  unknown  initial  state  to  an 
unknown  initial  state  distribution ,  where  each  set  of  observations  D  :=  {x(}‘\ x!}2), . . .  |  7r'0}} 
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consists  of  independent  samples  of  states  drawn  at  random  times  from  some  unknown  initial  state 
distribution  7r<0>.  For  example,  the  data  may  look  like: 


Dvo o)  =  {1,3,3,1,2,3,2,3,3,2}, 
Dnm  =  {3,1,2,3,2,1,3,2,3,1}, 
Di  o)  =  {2,1,2,3,3,3,3,1,2,3}, 

^3 


With  this  final  generalization,  most  would  agree  that  the  generated  data  are  non-sequenced  and 
that  the  generative  process  is  flexible  enough  to  model  some  of  the  real-world  situations  de¬ 
scribed  in  Chapter  1.  However,  simple  estimation  with  empirical  distributions  no  longer  works 
because  each  set  may  now  contain  observations  from  multiple  initial  states.  This  is  where  we 
take  advantage  of  the  tensor  decomposition  framework  outlined  in  Section  5.1,  which  requires 
proper  assumptions  on  the  initial  state  distribution  7T 0j  (Assumption  A. 3). 

More  formally,  the  aforementioned  ideas  motivate  the  following  three  assumptions: 

•  Assumption  A.l.  The  missing  times  {t,}  are  i.i.d.  according  to  a  Geometric  distribu¬ 
tion.  This  makes  it  possible  to  uniquely  recover  the  transition  matrix  P  from  the  expected 
transition  matrix  T. 

•  Assumption  A.2.  The  stationary  distribution  7r  of  the  Markov  chain  is  such  that  7it  ^ 
7 Tj,i  t -  j.  This  allows  recovering  the  correct  column  permutation  of  T. 

•  Assumption  A.3.  The  initial  state  distribution  tt10'  is  a  random  quantity  following  a 
Dirichlet  distribution,  and  E'7t':0)  =  7r,  the  stationary  distribution.  This  allows  the  use 
of  tensor  decomposition  methods. 

Now  we  are  ready  to  give  the  definition  of  our  entire  generative  process.  Assume  we  have  N  sets 
of  non-sequence  data  each  containing  n  observations,  and  each  set  of  observations  {x;}”=1  were 
independently  generated  by  the  following: 

•  Draw  an  initial  distribution 

7r(°)  ~  Dirichlet(a),  (Assumption  A. 3) 

E[7t(0)]  =  ol/(Y™=1  a*)  =  7r,  7Tj  ^  TTj  V  i  ^  j.  (Assumption  A. 2) 

•  For  i  =  1, . . . ,  n, 

■  Draw  a  discrete  time 

ti  ~  Geometric(r),  t,t  e  {1, 2, 3, . . .}.  (Assumption  A.l) 

■  Draw  an  initial  state 

Sj  ~  Multinomial(7r(0)),  s,  g  {0,  l}m. 

■  Draw  an  observation 

Xj  ~  Multinomial(PfiSj),  x,:  g  {0,  l}m. 

As  mentioned  earlier,  our  generative  process  captures  some  characteristics  of  real-world  situ¬ 
ations.  First,  all  the  data  points  in  the  same  set  share  the  same  initial  state  distribution  but  can 
have  different  initial  states;  the  initial  state  distribution  varies  across  different  sets  and  yet  centers 
around  the  stationary  distribution  of  the  Markov  chain.  As  mentioned  in  the  beginning  of  this 
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Figure  5.2:  Graphical  models  of  the  data  generative  process  for  first-order  Markov  models 


chapter,  this  may  be  achieved  in  biological  studies  by  running  multiple  experiments  with  differ¬ 
ent  input  stimuli,  so  the  data  collected  in  the  same  experiment  can  be  assumed  to  have  the  same 
initial  state  distribution.  Second,  each  data  point  is  drawn  from  an  independent  trajectory  of  the 
Markov  chain,  a  similar  situation  in  the  modeling  of  galaxies  or  Alzheimer’s,  and  random  time 
steps  could  be  used  to  compensate  for  individual  variations  in  speed:  a  small/large  £,;  corresponds 
to  a  slowly/fast  evolving  individual  object.  Finally,  the  geometric  distribution  can  be  interpreted 
as  an  overall  measure  of  the  magnitude  of  speed  variation:  a  large  success  probability  r  would 
result  in  many  small  Vs,  meaning  that  most  objects  evolve  at  similar  speeds,  while  a  small  r 
would  lead  to  V s  taking  a  wide  range  of  values,  indicating  a  large  speed  variation. 

Figure  5.2(a)  shows  the  graphical  model  of  our  generative  process.  Interestingly,  this  graph¬ 
ical  model  is  very  similar  to  the  widely-used  topic  model  Latent  Dirichlet  Allocation  (LDA) 
[Blei  et  al.,  2003].  In  fact,  by  summing  out  the  random  time  t,  we  obtain  a  graphical  model  that 
depends  only  on  the  expected  transition  probability  matrix  T  and,  as  shown  in  Figure  5.2(b), 
has  exactly  the  same  structure  as  LDA.  More  specifically,  we  can  view  a  set  of  non-sequence 
data  points  as  a  document  generated  by  an  LDA  model,  where  each  x,-  corresponds  to  a  word, 
Sj  to  a  topic,  the  expected  transition  matrix  T  to  the  word-topic  matrix,  the  initial  distribution 
7T(°)  to  the  topic  distribution  of  the  document,  and  the  stationary  distribution  7r  to  the  overall 
topic  proportions.  Such  a  structural  equivalence  to  LDA  allows  us  to  take  advantage  of  recent 
advances  in  spectral  learning  [Anandkumar  et  al.,  2012a]  with  rigorous  guarantees  on  parameter 
estimation.  However,  our  generative  process  has  a  critically  distinct  property:  the  words  are  the 
topics,  both  of  which  correspond  to  states,  and  as  a  result,  unlike  most  topic  models,  is  NOT 
invariant  to  column  permutations  of  the  word-topic  matrix.  We  thus  need  Assumption  A. 2  to  be 
able  to  recover  the  correct  permutation. 

Now  we  describe  our  spectral  learning  algorithm,  which  consists  of  three  main  steps: 

1 .  Compute  certain  low-order  moments  of  the  data; 

2.  Perform  tensor  decomposition  of  the  empirical  moments; 

3.  Recover  model  parameters  from  the  factors  given  by  tensor  decomposition. 

The  high-level  idea  is  that  according  to  our  generative  process,  certain  low-order  moments  of  the 
data  have  the  tensor  structure  (5.2)  with  the  factors  being  the  Markov  model  parameters,  so  we 
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can  use  the  tensor  decomposition  method  in  Section  5. 1  to  extract  the  model  parameters  from  the 
moments.  The  following  theorem  gives  the  desired  low-order  moments  and  their  structure: 
Theorem  2.  Let  «o  :=  Yhi  ai-  Define  the  expected  transition  probability  matrix  T  :=  Et[P*]  = 
rP(I  —  (1  —  r)P)_1  and 

C2  ■=  E[xi®x2], 

C3  :=  E[xi  ®x2  <g>x3], 

M2  (cto  +  1)6*2  -  aoE[xi]  ®  E[xi], 

3 

M3  ;=  («o+2Kao+i)C3  _  (oo±i)oo  E[Xl]  C2  +  c^E[Xl]03. 

d= 1 


Then  the  following  holds: 


E[Xl] 

C2 

C3 

m2 

m3 


7T. 


^TTdiag(7r)TT  +  ®  tt, 


«0  +  l 


(ao+2)(Qo+l) 


E^  +  s^E*®*^ 


2a, 


(ao+2)(ao+l) 


77 


d=  1 


Tdiag(7r)TT, 

E  *‘T?3- 


We  call  M2  and  M3  the  adjusted  moments  because  they  are  computed  from  the  raw  moments 
E[xx],  C2  and  C3.  Because  of  the  connection  of  our  generative  process  to  LDA,  the  proof  of 
this  theorem,  given  in  Appendix  B.1.1,  mainly  uses  existing  results  in  spectral  learning  of  LDA 
[Anandkumar  et  al.,  2013],  which  rely  on  the  special  structure  in  the  moments  of  the  Dirichlet 
distribution  (Assumption  A. 3).  According  to  Theorem  2,  it  is  clear  that  the  adjusted  moments  M2 
and  M3  have  the  desired  tensor  structure  (5.2).  Assuming  a0  is  known,  we  can  form  estimates  M2 
and  M3  by  computing  empirical  moments  from  the  data.  Note  that  the  x,-’s  are  exchangeable, 
so  we  can  use  all  pairs  and  triples  of  data  points  to  compute  the  estimates.  Since  the  tensor 
decomposition  method  may  return  T  under  any  column  permutation,  we  need  to  recover  the 
correct  matching  between  its  rows  and  columns.  To  do  so,  we  note  that  the  tt  returned  by  the 
tensor  decomposition  method  undergoes  the  same  permutation  as  the  columns  of  T,  and  because 
all  7Tj’s  have  different  values  by  Assumption  A. 2,  we  may  recover  the  correct  matching  by  sorting 
both  the  returned  n  and  the  mean  n  of  all  data. 

The  last  issue  is  recovering  P  from  T,  for  which  we  make  the  distributional  assumption 
A.l  on  the  random  times  {p}.  With  such  an  assumption,  we  have  reduced  the  search  space 
from  all  possible  mappings  between  T  and  P  to  one  single  parameter,  the  success  probability  r. 
Nevertheless,  recovering  P  and  r  is  in  general  still  difficult  even  when  the  exact  T  is  available, 
because  multiple  choices  of  P  and  r  may  result  in  the  same  T.  In  practical  situations,  however, 
we  can  often  assume  the  underlying  transition  probability  matrix  P  has  some  zero  entries,  e.g., 
when  the  true  Markov  chain  is  based  on  a  graph,  or  when  the  state  transition  is  under  some 
external  or  physical  constraint.  With  this  extra  assumption,  we  prove  that  unique  recovery  is 
possible  in  the  population  case: 
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Theorem  3.  Let  P*,  r*,  T*  and  7v*  denote  the  true  values  of  the  transition  probability  matrix,  the 
success  probability,  the  expected  transition  matrix,  and  the  stationary  distribution,  respectively. 
Assume  that  P*  is  ergodic  and  of  full  rank,  and  Pf  =  0  for  some  arbitrary  i  and  j.  Let  S  :  = 
{A/ (A  —  1)  |  A  is  a  real  negative  eigenvalue  of  T* }  U  {0}.  Then  the  following  holds: 

•  0  <  max(«S)  <  r*  <  1. 

•  For  all  r  G  (0, 1]  \  S,  P(r )  :=  (r/  +  (1  —  r)T*)_1T*  is  well-defined  and  satisfies 

■  lTP(r)  =  1T,  P(r) 7T*  =  n*,  P*  =  P(r*). 

■  P{r)ij  >0  V  i ,  j  r  >  r*. 

That  is,  P(r )  is  a  stochastic  matrix  if  and  only  if  r  >  r*. 

The  proof  is  in  Appendix  B.2,  and  the  key  step  is  to  show  that  the  zero  entries  in  P*  become 
negative  when  r  <  r*.  According  to  this  theorem,  binary  search  on  (0, 1]  suffices  to  recover  r* 
and  P*  from  T*.  However,  it  may  fail  when  we  replace  T*  by  an  estimate  T  because  even  Pie*') 
might  contain  negative  values.  A  more  practical  estimation  procedure  is  the  following:  for  each 
value  of  r  in  a  decreasing  sequence  starting  from  1,  we  project  P(r)  :=  ( rl  +  (1  —  r)T)~lT 
onto  the  space  of  stochastic  matrices  and  record  the  projection  distance.  Then  starting  from 
1,  we  search  in  the  sequence  of  projection  distances  for  the  first  sudden  increase,  and  take  the 
corresponding  value  of  r  and  (projected)  P(r)  as  the  final  estimates.  The  idea  is  that  I  f  r) 
should  be  close  to  the  space  of  stochastic  matrices  when  r  >  r*,  but  starts  to  move  away  by 
having  negative  entries  as  r  gets  smaller  than  r*.  It  is  easy  to  see  the  estimates  produced  by  such 
a  procedure  converge  to  r*  and  P*  as  T  gets  closer  to  T*  and  the  discrete  search  space  for  r 
becomes  denser.  However,  a  formal  convergence  rate  is  yet  to  be  identified.  Also,  while  lacking 
a  formal  proof,  we  suspect  that  the  more  zero  entries  P*  has,  the  easier  it  is  to  estimate  r*  because 
P(r)  for  r  <  r*  would  be  further  away  from  the  space  of  stochastic  matrices  by  having  more 
negative  entries.  Finally,  although  sparsity  is  sufficient  for  unique  recovery  of  P*  and  r*,  more 
investigation  is  needed  to  clarify  whether  it  is  also  necessary. 

We  summarize  the  entire  learning  procedure  in  Algorithm  5.3,  which  assumes  the  true  r  and 
a0  are  known.  Because  the  empirical  moments  are  consistent  estimators  for  the  true  moments 
and  the  tensor  decomposition  method  returns  accurate  estimates  under  small  input  perturbation, 
we  can  conclude  that  the  estimates  output  by  Algorithm  5.3  will  converge  (with  high  probability) 
to  the  true  quantities  as  the  sample  size  N  increases.  Sample  complexity  bounds  can  be  obtained 
with  techniques  similar  to  those  for  Theorem  5  in  the  next  section. 

Remarks  on  Identifiability.  Unlike  in  Chapter  3,  where  some  properties  of  the  true  dynamic 
model  are  not  identifiable  from  non-sequence  data,  our  proposed  method  here  guarantees  con¬ 
sistent  parameter  estimation.  The  main  reason,  as  one  can  imagine,  is  that  non-identifiability 
is  assumed  away  in  our  data  generative  model.  For  example,  consider  the  following  transition 
probability  matrix  and  its  transpose: 
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which  are  ergodic  for  0  <  q  <  1.  The  sequences  of  observations  generated  by  these  two 
Markov  chains  will  be  in  approximately  opposite  directions  of  time,  and  therefore  causes  non- 
identifiability  when  time  information  is  missing.  However,  such  Markov  chains  are  excluded  by 
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Algorithm  5.3  Tensor  decomposition  method  for  learning  Markov  chains  from  non-sequence 

data _ 

input  N  sets  of  non-sequence  data  points,  the  success  probability  r,  the  Dirichlet  parameter  a0, 
and  numbers  of  iterations  L  and  N . 
output  Estimates  7r  and  P. 

1:  Compute  empirical  averages  n,  C2  and  C:i. 

2:  Compute  M2  and  M3. 

3:  Run  Algorithm  5.1  on  M2  and  M3  with  target  dimension  m  to  obtain  a  symmetric  tensor 
T  e  Mmxmxm  and  a  whitening  transformation  W  e  Wmxm. 

4:  Run  Algorithm  5.2  m  times  each  with  numbers  of  iterations  L  and  N,  the  input  tensor  in 
the  first  run  set  to  T  and  in  each  subsequent  run  set  to  the  deflated  tensor  returned  by  the 
previous  run,  resulting  in  m  pairs  of  eigenvalue/eigenvector  { ( A .  v,)  } -Ci  ■ 

5:  Match  {(Aj,  v*)}™  i  with  observation  symbols  by  sorting  {A*}™ :  and  {7f*— 1//2}^fl1 . 

6:  Obtain  estimate  of  the  transition  probability  matrix: 

P  :=  (rl  +  (1  —  r)T)~1T, 

where  T  :=  (WT)^V A,  V  [v,  •  •  •  v^J,  and  A  :=  diagQAi  •  •  •  Am]T). 

7:  (Optional)  Project  P  onto  the  space  of  stochastic  matrices. 


our  assumption  that  the  stationary  distribution  n  satisfies  tt,  f  n j  V  i  f  j  because  PI  =  1.  More 
generally,  all  Markov  chains  with  a  doubly-stochastic  transition  probability  matrix  are  excluded 
by  that  assumption  because  their  stationary  distributions  are  the  uniform  distribution.  Another 
potentially  non-identifiable  class  of  models  are  the  time-reversible  Markov  chains  [Chapter  6.5 
Grimmett  and  Stirzaker,  2001],  where  the  transition  probability  P  and  the  stationary  distribution 
7r  satisfy 

T^iPji  V  i -.  } ■ 

A  well-known  result  is  that  the  time  direction  of  such  a  Markov  chain  cannot  be  distinguished 
from  the  reverse  direction  after  it  fully  mixes.  According  to  our  generative  assumption,  each 
of  the  non-sequence  data  sets  contains,  with  a  positive  probability,  observations  made  before  the 
Markov  chain  fully  mixes.  Those  observations  make  it  possible  to  eliminate  the  non-identifiability 
of  time  direction  even  in  the  case  of  time  reversible  models. 

5.2.2  Hidden  Markov  Models 

Equipped  with  the  intuition  and  strategies  for  learning  first-order  Markov  models,  we  are  now 
ready  to  handle  the  more  complicated  hidden  Markov  models.  As  detailed  later,  it  turns  out  that 
both  the  generative  process  and  the  learning  procedure  for  HMMs  are  quite  similar  to  those  for 
Markov  chains,  with  the  main  distinction  being  two  applications  of  tensor  decomposition,  where 
the  extra  one  is  due  to  the  mapping  from  the  hidden  state  space  to  the  observation  space. 

Let  P  and  n  now  be  defined  over  the  hidden  discrete  state  space  of  cardinality  k  and  have 
the  same  properties  as  the  first-order  Markov  model  in  Section  5.2.1.  Again,  we  assume  the 
non-sequence  data  consists  of  N  sets  of  data  points,  where  each  set  now  contains  continuous 
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Figure  5.3:  Graphical  model  of  the  data  generative  process  for  HMMs 


observations  in  M"1 .  The  generative  process  for  each  set  is  almost  identical  to  and  therefore 
shares  the  same  interpretation  with  the  one  for  Markov  chains,  except  for  an  extra  mapping  from 
the  discrete  hidden  state  space  to  a  continuous  observation  space: 

•  Draw  an  initial  hidden  state  distribution 

7r(°)  ~  Dirichlet(a), 

E[-7r(0)]  =  a/(E?=i  «*)  =  *3  TTi^TTjVi^  j. 

•  For  i  —  1, . . .  ,n, 

■  Draw  a  discrete  time 

U  ~  Geometric(r),  u  e  {1, 2, 3, . . .}. 

■  Draw  an  initial  hidden  state 

Si  ~  Multinomial(7r(0)),  s*  e  {0,  l}fc. 

■  Draw  a  hidden  state  at  time  U 

h,  ~  Multinomia^P^Sj),  h*  e  {0,  l}fc. 

■  Draw  an  observation: 

Xj  =  Uhi  +  Si, 


where  U  e  Wrl x k  denotes  a  rank-/;:  matrix  of  mean  observation  vectors  for  the 
k  hidden  states,  and  the  random  noise  vectors  e,’s  are  i.i.d  satisfying  E[e,;]  =  0, 
Var[ej]  =  cr2/,  and  E[(ej)^]  =  0, 1  <  d  <  m. 

A  graphical  model  representation  is  in  Figure  5.3.  Compared  with  the  graphical  model  in  Figure 
5.2(b),  the  observation  model  here  is  a  mixture  distribution  rather  than  a  discrete  state,  which 
makes  learning  more  complicated,  but  still  manageable.  For  simplicity  we  require  a  common 
spherical  noise  covariance,  but  our  method  can  be  easily  modified  to  allow  different  spherical 
covariances  cr2/  for  different  hidden  states  (c.f.  Section  3.2  of  [Anandkumar  et  al.,  2012a]).  In 
Section  5.4  we  will  discuss  possible  ways  to  handle  more  general  noise  covariances.  Another 
important  requirement  on  the  observation  noise  is  zero  skewness,  i.e.,  zero  third-order  moment. 
As  discussed  later,  we  need  this  condition  to  ensure  that  certain  moments  have  the  desired  ten¬ 
sor  structure.  While  zero  skewness  rules  out  some  potentially  useful  observation  models,  we 
discuss  in  Section  5.4  how  to  handle  one  interesting  class  of  skewed  observation  noise:  discrete 
observations. 

As  in  Section  5.2.1,  we  develop  our  spectral  learning  algorithm  around  the  tensor  structure 
(5.2)  in  low-order  moments  of  the  data: 
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Theorem  4.  Let  a0  :=  cti.  Define  the  expected  hidden  state  transition  matrix  T  :=  E,  |Pr  = 
rP(J  —  (1  —  rjP)-1  and 

V3 
^2 
^3 

M' 

M% 

C2 
C3 
m2 

m3 

Then  the  following  holds: 

Vl  =  Un, 

V2  =  Pdiag(7r)PT  +  a2I, 

3 

v3  =  ^7r4Pf3  +  ^^^(a2/), 

i  d=  1 

M'  =  Pdiag(7r)PT, 

m'  = 

i 

c2  =  ^UTd\ag(7r)(UT)T  + 

C>  =  („+2)E+l)  '£^T)f  +  ^  E Vi  ®*  ' ^  ~  („+22)(t  +  l)yE 

i  d=l 

M2  =  PTdiag(7r)(PT)T, 

Ms  =  ^7T4(PT)f. 

i 

The  proof  is  in  Appendix  B.1.2.  This  theorem  suggests  that  HMMs  require  two  applications 
of  the  tensor  decomposition  method:  one  on  the  adjusted  cross  moments  M2  and  M3,  as  in 
learning  Markov  chains,  for  extracting  the  matrix  product  UT,  and  the  other  on  the  adjusted 
covariance  M!2  and  tri-variance  M3  for  extracting  the  mean  observation  vectors  U.  Just  as  the 
low-order  moments  of  first-order  Markov  models  in  Theorem  2  are  similar  to  those  of  LDA, 
the  tensor  structures  here  also  have  connections  to  other  latent  variable  models.  First,  M!2  and 
M3  are  reminiscent  of  mixtures  of  spherical  Gaussians.  Indeed,  each  set  of  observations  can 
be  viewed  as  independent  samples  drawn  from  a  mixture  model  where  each  mixture  component 
is  a  distribution  with  a  spherical  covariance  and  the  mixture  weights  are  T 7r(0\  7T l);  denoting 


:=  E[xx], 

:=  E[xi  <S>  xi], 

:=  E[xi<g>3], 

:=  V2-*2/, 

3 

■■=  v3  -J2vi®d(cr2I), 

d=  1 

:=  E[xi®x2], 

:=  E[xi®x2®x3], 

:=  (cio  +  1)C*2  —  otoV\  ®  Vi, 

3 

:=  Go+2)Go+Dc.3  _  («o+i)»o  J2  Vj  ®d  C2  +  a20V®3 

d=  1 
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Algorithm  5.4  Tensor  decomposition  method  for  learning  HMM  from  non-sequence  data 
input  N  sets  of  non-sequence  data  points,  the  success  probability  r,  the  Dirichlet  parameter  a0, 
the  number  of  hidden  states  k,  and  numbers  of  iterations  L  and  N. 
output  Estimates  tv,  P  and  U  possibly  under  permutation  of  state  labels. 

1:  Compute  empirical  averages  Vi ,  V>,  V3,  C2,  C3,  and  a2  :=  Amin(Vr2  —  V\V\  ). 

2:  Compute  M2,  M3,  M!2,  M'> 

3:  Run  Algorithm  5.1  on  M2  and  M3  with  the  number  of  hidden  states  k  to  obtain  a  symmetric 
tensor  T  e  and  a  whitening  transformation  W  e  M.mxk. 

4:  Run  Algorithm  5.2  k  times  each  with  numbers  of  iterations  L  and  N,  the  input  tensor  in 
the  first  run  set  to  T  and  in  each  subsequent  run  set  to  the  deflated  tensor  returned  by  the 
previous  run,  resulting  in  k  pairs  of  eigenvalue/eigenvector  {(A*,  v,)} -'=l . 

5:  Repeat  Steps  4  and  5  on  M!2  and  M!,  to  obtain  Tr,  W'  and  { (  AC  v'J}  -=l . 

6:  Match  {(A,;,  V,;)}f=1  with  {(A-,  v')}f=1  by  sorting  {A<}f=1  and  {A'}f=1. 

7:  Obtain  estimates  of  HMM  parameters: 

UT  :=  (W^V A,  U  :  =  ( W,T)^V'hl 
P  :=  (rU  +  (1  -  r)UTyUT,  n  :=  [AV2  •  ■  ■  Afc'2]T, 

where  V  :=  [vy  •  •  •  v^],  A  :=  diag([Ai  •  •  •  Afc]T);  V'  and  A'  are  defined  in  the  same  way. 

8:  (Optional)  Project  tv  onto  the  simplex  and  P  onto  the  space  of  stochastic  matrices. 


the  initial  hidden  state  distribution  of  that  set.  Therefore,  when  forming  estimates  for  M'2  and 
M3,  which  require  an  estimate  for  the  noise  variance  a2,  we  may  use  the  existing  result  for 
spherical  Gaussians  (Theorem  3.2  in  [Anandkumar  et  al.,  2012a])  to  obtain  an  estimate  a2  = 
Amjn  (  V'2  —  Vy  V'Y'  )  •  Also  worth  noting  is  that  the  zero  skewness  condition  on  the  observation  noise, 
as  detailed  in  Appendix  B.1.2,  is  needed  so  that  M'z  has  the  desired  tensor  structure.  Second,  M2 
and  M3  can  be  viewed  as  cross  moments  of  a  topic  model  with  continuous  observations,  i.e., 
a  “word”  is  a  real  vector  drawn  from  a  topic-specific  continuous  distribution,  whose  mean  is  a 
column  of  UT.  Interestingly,  the  proof  of  Theorem  4  in  Appendix  B.1.2  indicates  that  M2  and 
M3  always  have  the  same  form  regardless  of  the  observation  noise  model.  This  property,  as 
discussed  later  in  Section  5.4,  allows  the  possibility  of  handling  more  general  noise  covariances. 

As  in  learning  Markov  chains,  we  need  to  resolve  issues  related  to  permutation  invariance 
inherent  in  tensor  decomposition.  The  situation  is  a  bit  more  complicated  here.  First  note  that 
P  =  ( rU  +  (1  —  r)UTyUT,  which  implies  that  permuting  the  columns  of  U  and  the  columns 
of  UT  in  the  same  manner  has  the  effect  of  permuting  both  the  rows  and  the  columns  of  P, 
essentially  re-labeling  the  hidden  states.  Hence  we  can  only  expect  to  recover  P  up  to  some 
simultaneous  row  and  column  permutation.  By  the  assumption  that  7T,;’s  are  all  different,  we  can 
sort  the  two  estimates  tv'  and  7r  to  match  the  columns  of  U  and  UT,  and  obtain  P  if  r  is  known. 
When  r  is  unknown,  a  similar  heuristic  to  the  one  for  first-order  Markov  models  can  be  used  to 
estimate  r,  based  on  the  fact  that  P  =  (rU  +  (1  —  r)UTyUT  =  ( rl  +  (1  —  r)T)~lT,  meaning 
Theorem  3  still  holds  when  expressing  P  by  U  and  UT. 
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Algorithm  5.4  gives  the  complete  procedure  for  learning  HMM  from  non-sequence  data. 
Combining  the  perturbation  bound  of  the  tensor  decomposition  method  in  Theorem  1,  perturba¬ 
tion  theory  on  the  whitening  procedure  (Appendix  B. 3.1)  and  the  matrix  pseudo  inverse  [Stewart, 
1977],  and  concentration  bounds  on  empirical  moments  (Appendix  7),  we  provide  a  sample  com¬ 
plexity  analysis  of  the  proposed  algorithm: 

Theorem  5.  Suppose  the  numbers  of  iterations  N  and  L  for  the  tensor  decomposition  methods 
satisfy  the  conditions  in  Theorem  1,  and  the  number  of  hidden  states  k,  the  success  probability  r, 
and  the  Dirichlet  parameter  a0  are  known.  For  any  rj  G  (0, 1)  and  e  >  0,  if  the  number  of  sets 


N  > 


12  ma  x(/c2,  m)m3u3(ao  +  2)2(a:o  +  l)s 


V 

/225000  4600  42000c2 o\ (UT)2  max(cri(C/T),  cri(U),  l)2 

maX  V  <52lin  ’  mm(CTfc(M'),CTfc(M2))2’  e2ok(rU  +  (1  -  r)UT)4  mm(crk(UT),  ak(U),  l)4 

where  c  is  some  constant,  v  :=  max(cr2  +  maxitk(\Uik\2),  l),<5min  :=  miiijj  —  l/y7fy|, 

and  crj(-)  denotes  the  i-th  largest  singular  value,  then  there  exists  a  permutation  matrix  II  such 
that  the  P  and  U  returned  by  Algorithm  5.4  satisfy 


Prob(\\P  —  nTPn||  <  e)  >  1  —  rj  and  Prob  (  \\U  —  C/IT||  < 


eak(rU+  (1  -  r)UTf 
6crJjjT) 


>1-7, 


where  ||  •  ||  denotes  the  matrix  spectral  norm. 

The  proof  is  in  Appendix  B.4.  In  this  result,  the  sample  size  N  exhibits  a  fairly  high-order 
polynomial  dependency  on  m,  k,e_1  and  scales  with  the  inverse  of  the  failure  probability  1  /// 
linearly  instead  of  logarithmically,  as  is  common  in  sample  complexity  results  on  spectral  learn¬ 
ing  [Anandkumar  et  al.,  2012a,b].  This  is  mainly  because  we  do  not  impose  boundedness  or 
sub-Gaussianity  constraints  on  the  observation  model,  and  only  use  the  weaker  Markov  inequal¬ 
ity  for  bounding  the  deviation  in  the  empirical  moments.  Note  that  simply  assuming  the  state- 
conditioned  observation  noise  to  be  sub-Gaussian  does  not  enable  the  use  of  stronger  bounds 
such  as  Hoeffding  bounds,  because  a  mixture  of  sub-Gaussian  distributions  may  not  be  sub- 
Gaussian.  One  possible  strategy  to  strengthen  our  result  is  applying  the  analysis  techniques  of 
Hsu  and  Kakade  [2013],  who  demonstrate  that  the  sample  complexity  of  spectral  learning  of  cer¬ 
tain  mixture  models  has  a  logarithmic  dependency  on  1/p.  However,  efforts  beyond  a  direct  use 
of  their  results  are  likely  needed  due  to  our  LDA-like  moments,  M2  and  M3.  Also  worth  noting 
is  that  5~2n  acts  as  a  threshold.  As  shown  in  our  proof,  as  long  as  the  operator  norm  of  the  tensor 
perturbation  is  sufficiently  smaller  than  5min,  which  measures  the  gaps  between  different  nfs,  we 
can  correctly  match  the  two  sets  of  estimated  tensor  eigenvalues.  Lastly,  the  lower  bound  of  N, 
as  one  would  expect,  depends  on  conditions  of  the  matrices  being  estimated  as  reflected  in  the 
various  ratios  of  singular  values. 

An  interesting  quantity  missing  from  the  sample  analysis  is  the  size  of  each  set  n.  To  simplify 
the  analysis  we  essentially  assume1  n  =  3,  but  understanding  how  n  might  affect  the  sample 
complexity  may  have  a  critical  impact  in  practice:  given  a  fixed  budget  on  the  total  number  of 

'To  be  rigorous,  we  assume  n  to  be  the  smallest  number  so  that  we  can  compute  from  a  single  set  all  the  various 
empirical  moments  with  non-overlapping  data  points. 


71 


observations  that  can  be  made,  should  we  collect  more  sets  or  larger  sets?  What  quantities  may 
this  choice  depend  on?  We  do  not  have  a  formal  result  yet,  but  intuitively,  more  sets  seem  to  be 
always  as  good  as,  if  not  better  than  larger  sets.  According  to  our  generative  process,  a  larger 
set  provides  more  information  only  about  a  subset  of  T’s  columns,  those  corresponding  to  the 
hidden  states  on  which  the  set-specific  initial  state  distribution  7r(0)  has  large  probability  mass, 
whereas  more  sets  provide  more  information  about  the  entire  model.  A  rigorous  analysis  is  an 
interesting  direction  for  future  work. 


5.3  Simulation 


We  consider  learning  HMMs  from  non-sequence  data  produced  by  the  assumed  generative  pro¬ 
cess.  The  true  HMM  has  m  =  40,  k  —  5  and  spherical  Gaussian  noise  with  a2  =  2.  The  mean 
vectors  U  were  sampled  from  an  independent  univariate  standard  normal  and  then  normalized  to 
lie  on  the  unit  sphere.  The  transition  matrix  P  and  the  stationary  hidden  state  distribution  tv  are 
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The  transition  probability  matrix  has  exactly  one  zero  entry.  We  conduct  two  experiments. 

The  first  experiment  is  a  sanity  check  on  the  consistency  of  the  proposed  algorithm.  We  set 
a0  =  1  and  r  =  0.3  in  the  generative  process,  and  consider  different  numbers  of  sets  N  e 
1000(2°,  21, . . . ,  210},  while  fixing  the  size  of  each  set  n  =  1000.  The  numbers  of  iterations 
for  the  tensor  decomposition  method  were  N  =  200  and  L  =  1000.  Figure  5.4(a)  plots  the 
relative  matrix  estimation  error  (in  spectral  norm)  against  the  sample  size  N  for  P,  U,  and  UT, 
showing  that  U  is  the  easiest  to  leam,  followed  by  UT ,  and  P  is  the  most  difficult,  and  that  all 
three  errors  converge  to  a  very  small  value  for  sufficiently  large  N.  Note  that  in  Theorem  5  the 
bounds  for  P  and  U  are  different.  With  the  model  used  here,  the  extra  multiplicative  factor  in 
the  bound  for  U  is  less  than  0.007,  suggesting  that  U  is  indeed  easier  to  estimate  than  P.  Figure 
5.4(b)  demonstrates  the  heuristics  for  determining  r,  showing  projection  distances  (in  logarithm) 
versus  r.  As  N  increases,  the  take-off  point  gets  closer  to  the  true  r  =  0.3.  The  large  peak 
indicates  a  pole  (the  set  S  in  Theorem  3). 

The  second  experiment  compares  the  proposed  method  with  the  popular  EM-based  learning 
paradigm.  In  Appendix  A  we  derive  a  variational  EM  algorithm  for  learning  HMM  parameters 
assuming  the  generative  process  in  Section  5.2.2.  In  this  experiment,  the  generative  process  has 
the  same  settings  as  in  the  first  experiment  except  the  number  of  sets  N,  which  takes  smaller  val¬ 
ues  (125, 250, 500, 1000,  2000, 4000}.  We  repeat  the  experiment  20  times  with  different  random 
draws  from  the  generative  process.  Figure  5.5  gives  the  relative  estimation  errors  for  U  (in  spec¬ 
tral  norm)  and  P  (in  entrywise  1-norm)  for  three  methods:  Algorithm  5.4  (tensor),  variational 
EM  initialized  with  the  output  of  Algorithm  5.4  (tensor+vbEM),  and  variational  EM  initialized 
with  100  random  parameter  values  (rand+vbEM).  Clearly,  Algorithm  5.4  outperforms  the  ran¬ 
domly  initialized  variation  EM,  and  there  is  barely  any  improvement  resulting  from  combining 
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Figure  5.4:  Simulation  confirming  consistency  of  the  proposed  algorithm 
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Figure  5.5:  Comparison  between  Algorithm  5.4  and  EM 


the  two  methods,  except  when  N  is  very  small.  In  terms  of  computational  efficiency,  we  ob¬ 
serve  that  Algorithm  5.4  is  orders  of  magnitude  faster  than  the  variational  EM  algorithm.  On  our 
platform  with  48  cores  (2.3  GHz  each)  and  512GB  of  memory,  Algorithm  5.4  takes  a  couple  of 
hours  to  finish  all  20  experiments,  but  the  variational  EM  method  takes  days. 


5.4  Discussion 

We  have  demonstrated  that  under  reasonable  assumptions,  tensor  decomposition  methods  can 
provably  leam  first-order  Markov  models  and  hidden  Markov  models  from  non-sequence  data. 
We  believe  this  is  the  first  formal  guarantee  on  learning  dynamic  models  in  a  non-sequential 
setting.  There  are  several  possibilities  in  improving  or  generalizing  our  results. 

Procedure  for  estimating  r  with  formal  guarantees 

Our  current  heuristics  for  estimating  r  requires  a  good  measure  of  sudden  increase  or  take-off 
spot  in  the  curve  of  projection  distance  v.s.  r,  which  is  hard  to  define  because,  depending  on  the 
true  transition  matrix  P,  the  curve  may  be  rather  smooth  near  the  true  take-off  point,  as  shown  in 
Figure  5.4(b).  We  suspect  that  as  P  becomes  sparser,  the  curve  shows  a  sharper  increase  at  the 
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true  take-off  point,  but  do  not  have  a  concrete  result  yet.  If  such  results  can  be  established,  it  is 
then  possible  to  develop  a  change-point  detection  based  procedure  for  estimating  r  with  formal 
guarantees. 

Other  distributions  for  the  missing  times 

No  matter  what  distribution  generates  the  random  time  steps,  tensor  decomposition  methods  can 
always  leam  the  expected  transition  probability  matrix  T.  Depending  on  the  specific  modeling 
task,  one  may  replace  the  geometric  distribution  with  some  other  distribution,  such  as  Poisson. 

HMMs  with  discrete  observations 

With  extra  assumptions  on  the  state-observation  probability  matrix,  we  can  modify  our  proposed 
algorithm  to  guarantee  consistent  parameter  learning  in  the  case  of  discrete  observations.  More 
precisely,  let  O  e  [0,  l]mxfc  denote  the  state-observation  probability  matrix,  where  each  column 
is  the  observation  probability  vector  for  a  hidden  state.  We  first  apply  our  tensor  decomposition 
based  method  to  recover  the  matrix  product  OT ,  and  then  obtain  estimates  of  O  and  T  with 
the  non-negative  matrix  factorization  (NMF)  algorithm  proposed  by  Arora  et  al.  [2012],  which 
guarantees  consistency  under  the  “anchor  word”  assumption,  requiring  that  each  column  of  O 
has  a  corresponding  row  whose  only  positive  entry  coincides  with  itself. 

Weaker  assumption  on  Dirichlet  parameters  a 

So  far  we  have  assumed  that  the  normalized  Dirichlet  parameter  vector  at/  JA  a  is  equal  to  the 
stationary  hidden  state  distribution,  allowing  the  columns  and  rows  of  the  expected  transition 
probability  matrix  T  to  be  correctly  matched.  However,  a  careful  look  into  the  tensor  structures 
in  Theorems  2  and  4  reveals  that  the  weaker  conditions  a*  >  ay  (Ta)j  >  ( Tot)j  Vi  ^  j 

and  a,;  ^  ay  Vi  ^  j  are  sufficient  for  correct  matching.  To  interpret  such  conditions,  we  note  that 
at  is  proportional  to  the  average  initial  hidden  state  distribution,  while  Tat  to  the  average  hidden 
state  distribution  that  generates  the  observations.  As  long  as  the  hidden  states,  when  sorted  by 
probability  mass,  are  in  the  same  unique  order  under  these  two  distributions,  we  can  correctly 
match  the  rows  and  columns  of  T. 

General  observation  noise  covariance 

As  pointed  out  in  Section  5.2.2,  it  is  the  tensor  decomposition-based  estimation  of  the  mean 
observation  vectors  U  that  requires  the  assumption  of  a  spherical  noise  covariance,  while  the 
estimation  of  UT  can  always  be  carried  out  via  tensor  decomposition  regardless  of  the  noise 
distribution.  This  property,  together  with  the  fact  that  each  set  of  non-sequence  data  can  be 
viewed  as  independent  samples  drawn  from  a  mixture  model,  suggests  modifications  for  han¬ 
dling  more  general  noise  covariances  by  using  alternative  methods  to  estimate  U.  For  example, 
under  reasonable  assumptions  on  the  separation  between  the  mean  observation  vectors  U,  the 
spectral  projection  based  methods  proposed  by  Achlioptas  and  McSherry  [2005];  Kannan  et  al. 
[2005]  are  guaranteed  to  return  accurate  parameter  estimates  of  mixtures  of  log-concave  dis¬ 
tributions  with  general  noise  covariances.  In  the  case  of  Gaussian  mixtures,  the  approach  by 
Moitra  and  Valiant  [2010]  provably  leams  the  parameters  with  minimal  assumptions  that  require 
no  separation  between  the  mean  vectors  and  allow  general  covariances,  though  their  algorithm, 
despite  its  polynomial  time  and  sample  complexity,  is  far  from  being  practical.  Or  in  the  case 
of  the  multi-view  Gaussian  mixture  models  considered  by  Anandkumar  et  al.  [2012b],  where 
the  covariances  of  different  mixture  components  share  the  same  block  diagonal  structure,  tensor 
decomposition  based  methods  can  also  provably  leam  the  parameters. 
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Chapter  6 

Learning  Hidden  Markov  Models  from 
Sequence  and  Non-sequence  Data 


In  this  chapter  we  consider  learning  HMMs  when,  in  addition  to  non-sequence  data,  there  are 
also  some  sequence  data.  The  non-sequence  data  here  are  assumed  to  be  independent  samples 
drawn  from  the  stationary  distribution  of  the  underlying  HMM.  Unlike  the  methods  proposed  in 
the  last  chapter,  which  give  direct  estimates  of  HMM  parameters,  our  proposed  methods  here 
leam  an  obsen’able  representation  of  the  underlying  HMM.  In  the  usual  sequence-data  only  set¬ 
ting,  spectral  learning  of  observable  representation  of  HMMs  [Hsu  et  al.,  2009;  Siddiqi  et  al., 
2010;  Song  et  al.,  2010]  is  becoming  an  appealing  alternative  to  the  popular  EM  method  because 
of  its  formal  theoretical  guarantee  and  more  importantly,  empirical  success  in  several  applica¬ 
tions  ranging  from  robot  vision  to  music  analysis  [Song  et  al.,  2010].  Building  on  these  recent 
advances,  we  propose  spectral  methods  that  combine  sequence  and  non-sequence  data  for  learn¬ 
ing  HMMs.  Unlike  most  spectral  algorithms  which  apply  Singular  Value  Decomposition  (SVD) 
to  moments  estimated  by  empirical  averages  of  data,  our  methods  first  solve  a  penalized  least 
square  problem  to  get  better  estimates  of  moments,  and  then  apply  SVD  1 .  As  one  may  imagine, 
the  penalized  least  square  problem  here  has  a  similar  structure  to  the  one  in  Chapter  4,  where  the 
objective  consists  of  a  squared  error  function  on  the  sequence  and  a  regularization  term  based 
on  non-sequence  data.  But  somewhat  surprisingly,  as  we  will  show  in  details  later,  the  opti¬ 
mization  problems  here  turn  out  to  be  convex,  even  though  they  are  dealing  with  a  more  com¬ 
plex  model  than  the  VAR  model  in  Chapter  4.  Through  experiments  on  synthetic  data  and  real 
Inertia-Measurement  Unit  recordings  of  human  activities,  we  demonstrate  that,  as  with  VARs, 
incorporating  non-sequence  data  also  improves  estimation  of  HMMs. 

This  chapter  is  organized  as  follows.  Section  6. 1  briefly  reviews  spectral  learning  algorithms, 
and  Section  6.2  details  the  proposed  algorithms,  followed  by  experiments  and  results  in  Section 
6.3  and  conclusions  in  Section  6.4. 


1  The  general  idea  of  invoking  convex  optimization  in  spectral  learning  has  been  proposed  recently  in  the 
sequential  learning  setting.  Among  others,  Balle  et  al.  [2012]  solve  a  convex  program  in  place  of  SVD,  while 
Balle  and  Mohri  [2012]  use  convex  optimization  to  obtain  input  matrices  to  spectral  algorithms. 
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6.1  Spectral  Learning  of  HMMs 

We  begin  with  discrete  observations,  and  mainly  follow  the  exposition  by  Siddiqi  et  al.  [2010]. 
Instead  of  learning  the  original  model  parameters,  i.e.,  initial  state  probabilities,  state  transition 
probabilities,  and  state-conditioned  observation  probabilities,  the  spectral  algorithm  learns  an 
obsen’able  representation  of  the  HMM,  which  consists  of  the  following  parameters: 


bi 

■=  f/TP, 

(6.1) 

boo 

:=  (Pj,iUyP, 

(6.2) 

Bx 

■■=  (UTPs,x,1)(UTP2,1)\  1<x<N, 

(6.3) 

where  f  denotes  the  pseudo  inverse,  N  is  the  number  of  observation  symbols,  p  is  the  stationary 
distribution  of  observations,  and  P2,i  and  P.i,x,i  are  joint  observation  probability  matrices  such 
that  for  1  <  i,  x,  j  <  N, 

■=  Prob(ay+i  =  i,xt  =  j),  4 

{Pz,x,\)ij  :=  Prob(oy+i  =  i,xt  =  x,xt-i  =  j), 

Xt  being  the  observation  symbol  at  time  t,  and  U  €  MiVxfc  is  column  concatenation  of  the  top 
k  left  singular  vectors  of  P2,i-  As  the  name  suggests,  the  observable  representation  parameters 
(6.1)  to  (6.3)  only  depend  on  observable  quantities,  leading  naturally  to  the  estimates  b^b^, 
and  Bx  based  on  empirical  averages  p,  P2jl,  P3jXi i,  and  U.  the  top- A'  left  singular  vectors  of  P2,i- 
These  estimates  allow  us  to  perform  inferences  on  a  new  sequence  of  observations  yi, . . . ,  yt: 

•  Predict  whole  sequence  probability: 

Prob(j/!,  ...,yt)  =  b  JcByt  ■  ■  ■  Byi  (6.5) 

•  Internal  state  update:  bt+1  :=  Pytbt/(b^0P!/tbt). 

•  Conditional  probability  of  yt  given  y\, ,  yt~ i- 

Prob(yt\yi, ...,  yt_{)  :=  ■  (6.6) 

Ex  KoBxbt 

Under  some  mild  conditions,  of  which  the  most  critical  being  that  both  the  state  transition  and 
state-conditioned  observation  probability  matrices  are  of  rank  k,  Siddiqi  et  al.  [2010]  showed 
that  the  whole  sequence  probability  estimate  (6.5)  is  consistent  (with  high  probability)  and  gives 
a  finite- sample  bound  on  the  estimation  error. 

Based  on  the  same  idea,  Song  et  al.  [2010]  developed  a  spectral  algorithm  for  learning  HMMs 
with  continuous  observations.  Instead  of  operating  on  probability  distributions  directly,  their  al¬ 
gorithm  operates  on  Hilbert  space  embeddings  of  distributions  of  observable  quantities  (assum¬ 
ing  stationarity  of  the  HMM): 

Pi  ■=  EXt[(/>(xf)],  (6.7) 

C2, i  '■=  EXt+1Xt[0(xm)  ®  0(xt)],  (6.8) 

^"3,x,l  •=  EXt+2(Xf+1=x)X(  [0(xt+2)  ®  0(xt)] 

=  P(xt  =  x)C3ji|20(x),  (6.9) 
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where  xf  denotes  the  continuous  observation  vector  at  time  t,  ([){■)  maps  the  real  observation 
space  to  a  Reproducing  Kernel  Hilbert  Space  (RKHS),  (8)  denotes  the  tensor  product,  and  C3)i|2  :  = 
CX(+2xt|xt+1  is  a  conditional  embedding  operator  [Song  et  al.,  2009].  Using  these  embeddings, 
they  derived  an  observable  representation  of  the  embedded  HMM,  which  consists  of  the  follow¬ 
ing  parameters: 


&  :=  UTn  i,  (6.10) 

0oc  :=  C2jl(WTC2)1)t,  (6.11) 

Bx  :=  (WTC3!X.1)(WTC2,1)t,  (6.12) 

where  U  is  the  top- A'  left  singular  vectors  of  Co  a-  They  then  showed  that  the  embedding  of  the 
predictive  distribution  P(xf  |xl5 . . . ,  xt_i)  takes  the  form  /rXf|xli...>X(_1  =  (H^B^  ■  ■  •  BXt_1/3l 

and,  as  in  the  case  of  discrete  observations,  proposed  estimates  based  on  empirical  averages 
Jli ,  C2  i ,  6-3  x1,  and  U.  which  is  the  top-A;  left  singular  vectors  of  C2.\.  Using  the  kernel  trick 
and  techniques  from  Kernel  Principle  Component  Analysis  [Scholkopf  et  al.,  1998],  they  gave 
an  estimation  procedure  that  operates  solely  on  finite-dimensional  quantities.  Moreover,  to  avoid 
the  difficulty  of  partitioning  the  observation  space  required  by  estimation  of  £>x,  they  proposed 
to  estimate  instead 

Bx  :=  (WTC3,i|20(x))(WTC2,1)t,  (6.13) 

which  is  only  a  fixed  multiplicative  factor  P(x)  away  from  £>x,  and  have  pxt\xi . x,  ,  proportional 

to  [3ocBXl  ■  ■  ■  BXt_1/31.  Under  some  mild  conditions,  they  established  the  consistency  (with  high 
probability)  of  their  estimator  for  pXt |Xli....Xf_1  and  gave  a  finite-sample  bound  on  the  estimation 
error. 

In  addition  to  estimation,  Song  et  al.  [2010]  also  discussed  possible  ways  to  carry  out  pre¬ 
diction.  In  particular,  they  showed  that  in  the  case  of  Gaussian  RBF  kernel,  pxt\xi, takes 
the  form  of  a  nonparametric  density  estimator  after  proper  normalization,  and  one  may  choose, 
from  training  data  or  a  pool  of  samples,  the  observation  with  the  highest  predictive  density  as  the 
prediction. 


6.2  Spectral  Methods  for  Learning  HMMs  from  Sequence  and 
Non-sequence  Data 

Suppose  in  addition  to  sequence  data,  which  can  be  time  series  of  observations  or  triples  of 
consecutive  observations,  we  also  have  a  set  of  non-sequence  data  points,  which  are  drawn  inde¬ 
pendently  from  the  stationary  distribution  of  the  underlying  HMM.  We  propose  to  improve  the 
estimation  of  the  observable  representation  of  HMMs  by  solving  regularized  least  square  prob¬ 
lems,  which  minimize  a  squared  error  term  on  the  sequence  data  and  a  regularization  term  based 
on  the  non-sequence  data.  As  in  existing  work  on  spectral  learning  of  HMMs,  we  assume  that 
the  sequence  data  are  observed  after  the  HMM  has  fully  mixed. 
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6.2.1  Discrete  Observations 


Our  method  has  two  main  steps.  We  first  estimate  P2, 1,  and  then  b1?  b^,  and  Bx s.  Let  N  denote 
the  number  of  unique  observation  symbols.  To  make  use  of  non-sequence  data  in  estimating  P2,i. 
we  note  that  the  marginal  of  P2ii  is  the  stationary  distribution  of  the  discrete  HMM.  Moreover, 
from  spectral  learning  methods  we  have  the  assumption  of  P2;i  being  low-rank.  We  thus  propose 
the  following  estimator  P2.\  defined  as 

arg nun  ^\\W  0  (F  P2,i)|||  +  r||P||*+ 

|(||P-^1||I  +  ||P-PT1||^  (6'14) 

s.t.  1TP1  =  1,  Pij  >  0, 

where  p  is  the  empirical  observation  distribution  of  both  the  sequence  and  the  non-sequence 
data ,  W  is  an  indicator  matrix  such  that  Wt]  =  1  •<=>■  (P2,i)ij  >  0,  ©  denotes  the  Hadamard 
product,  ||  ■  ||  *  denotes  the  matrix  nuclear  norm,  a  standard  convex  relaxation  of  matrix  rank,  1  is  a 
vector  of  ones,  and  u,  r  >  0  are  regularization  parameters.  The  objective  in  (6.14)  minimizes  the 
squared  error  from  the  sequence-only  estimate  P2j i  while  penalizing  the  rank  and  the  deviation 
from  the  marginal  p.  It  is  easy  to  see  that  (6.14)  is  a  convex  but  non-smooth  problem  due  to  the 
matrix  nuclear  norm.  Projected  sub-gradient  descent  methods  are  a  common  way  to  solve  such 
problems,  but  are  known  to  suffer  from  slow  convergence  [Bertsekas,  1999].  We  solve  (6.14) 
by  a  variant  of  the  smoothing  proximal  gradient  (SPG)  method  proposed  by  Chen  et  al.  [2012], 
which  achieves  a  provably  faster  convergence  rate  than  projected  sub-gradient  methods  but  has 
a  similar  per-iteration  time  complexity.  In  Section  6.2.2  we  use  SPG  to  solve  the  continuous 
version  of  the  estimation  problem,  which  has  a  more  general  form,  and  hence  describe  more 
details  there. 

To  set  r  in  the  right  scale,  we  use  the  following  fact  about  matrix  norms: 

||P2,i||*/(V  <  (r/AO^II^ilUlPdh,  (6-15) 

where  r  is  the  rank  of  P2)i,  and  ||  •  and  ||  •  ||i  denote  matrix  oo-norm  and  1-norm,  respec¬ 
tively.  Assuming  stationarity,  we  have  ||P2,i||oo  =  ||P2,i  ||  i  =  max,  p, ,  where  p  is  the  stationary 
distribution  of  observations.  Therefore,  P2,i’s  average  singular  value  is  Off  max,  p,)/Ar).  As 
shown  by  Cai  et  al.  [2010],  r  has  an  effect  of  soft-thresholding  singular  values  of  P2)1,  so  we  let 
r  =  A  max.,  p  -JN  and  tune  A  instead. 

We  then  compute  the  SVD  of  P2;i,  denoting  its  top- A  left  singular  vectors  as  an  Ar-by-A 
matrix  U,  and  obtain  estimates  of  bi  and  b^  in  the  same  ways  as  (6.1)  and  (6.2)  using  P2  ] ,  U, 
and  p.  To  derive  our  estimator  of  Bx,  we  first  note  that  the  original  estimator  based  on  (6.3)  is 
the  solution  to  the  following  problem: 

Bx  :=  arg  min  \\P3,x,i  ~  U  BUT  P2,i\\2F,  (6.16) 

B 

showing  that  Bx  is  a  low-dimensional  representation  of  PSjX ,i-  As  in  (6.14),  we  aim  to  regularize 
the  least-square  problem  (6.16)  with  non-sequence  data.  Instead  of  constructing  a  regularization 
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term  directly  from  non-sequence  data,  we  use  our  new  estimator  P2,i  based  on  the  fact  that 
(lT^3,x,i)j  =  {P-2,i)xj  and  (P3,xAl)i  =  (Py)ia.,  i.e.,  the  marginals  of  {P3,x,i}  are  equal  to  P2A. 
We  thus  propose  the  following  estimator  {Bx}  defined  as 


argminj^  \\\WX  ©  ( UBXVT  -  P3,x,i)\\2F+ 

X&x  f  " 

X 

x,i 

^(©,i)»-(lTC/B©T)i)2, 

x,i 

s.t.  >  0,  ^  1tPP.t1/t1  =  1, 

X 


where  Wx  is  an  indicator  matrix  such  that  {Wx)l3  >  0  (P3,x,i)ij  >  0  and  V  :=  U  1\A. 

Note  that  we  not  only  add  regularization  terms  but  also  constrain  the  fitted  matrices  {UBXVT} 
to  lie  on  a  simplex,  mainly  to  reduce  negative  values  in  the  predictive  distribution  (6.6)  during 
inference.  The  simplex  constraints  may  incur  more  bias  than  desired  and  may  not  always  be 
feasible2,  but  in  our  experiments  we  do  not  observe  any  negative  effect.  Later  in  Section  6.4  we 
discuss  the  possibility  of  combining  the  two  optimization  problems  (6.14)  and  (6.17)  into  one, 
which  may  fix  some  of  these  constraint-related  issues  but  pay  the  price  of  a  bigger  problem  size. 

Eq.  (6.17)  is  a  quadratic  program  of  k2N  variables  under  one  linear  equality  constraint 
and  N 3  linear  inequality  constraints.  When  N  is  on  the  order  of  a  few  hundreds  and  A;  is  a 
few  tens,  a  reformulation  that  takes  advantage  of  the  block-diagonal  structure  in  the  Hessian  of 
(6.17)  can  be  solved  quite  efficiently  with  state-of-the  art  optimization  software,  such  as  MOSEK 
(www.mosek .  com).  For  larger  problems,  one  possible  solution  is  the  Alternating  Direction 
Method  of  Multipliers  [Boyd  et  al.,  2011],  which  handles  constraints  by  minimizing  the  original 
objective  augmented  with  a  iteratively-refined  constraint  violation  term.  Our  experiments  in 
Section  6.3.1  have  N  =  100,  so  we  solve  (6.17)  with  MOSEK. 

6.2.2  Continuous  Observations 

Our  method  for  continuous  observations  builds  on  the  Hilbert  space  embedding  approach  by 
Song  et  al.  [2010],  and  consists  of  two  main  steps:  estimating  the  feature  covariance  C2.i  and 
then  the  observable  representation  /3X,  /30O,  and  £>x.  Let  the  feature  mappings  of  the  sequence 
data  be  organized  into  three  matrices  <f>i,  <f>2,  and  <f>3  such  that  their  z-th  columns  <I>j ,  <I>j,  and 
<f>3  are  consecutive  and  going  forward  in  time.  By  the  definition  of  the  feature  covariance  (6.8), 
we  have  C2,i  :=  f  0(x)  ©  </>(y)pxt+1xt(x,  y)d*-dy.  If  we  have  a  set  of  feature  points  grouped 
column-wise  as  a  feature  matrix  <f>,  and  know  exactly  which  pairs  of  points  are  consecutive 
in  time  via  a  (normalized)  temporal  adjacency  matrix  T2.\,  we  may  then  compute  the  quantity 
&T-2, i$t  as  an  unbiased  estimator  of  C2,i  It  is  easy  to  see  that  C2)i  :=  ^$2^*7  one  special 

2When  this  happens  one  may  choose  the  smallest  k  that  makes  the  constraints  feasible,  and  then  solve  (6.17). 
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case  of  such  an  estimator.  To  incorporate  non-sequence  data  into  our  estimation  procedure,  we 
denote  its  feature  matrix  by  Z  and  consider  another  special  case: 


C2ll  :=  Z2PZjt 


(6.18) 


where  Z\  :  =  [<f>i  Z)  and  Z2  :=  [$2  Z],  It  then  suffices  to  estimate  P  subject  to  1TP1  =  1  and 

Pij>  0. 

Similar  to  Section  6.2.1,  our  estimation  objective  consists  of  three  terms:  the  squared  error 
between  C2j 1  and  C2, 1,  penalization  on  C2,i’s  rank,  and  deviation  of  C2.i’s  marginal  from  the 
mean  of  the  stationary  distribution.  The  last  term  is  based  on  the  fact  that,  under  the  assumption 
of  stationarity,  C2iif  =  E[0(X)]  holds  for  some  constant  function  f  in  Q  such  that  f(x)  = 
fT0(x)  =  1  Vx.  Formally,  our  estimator  P  is  the  solution  to  the  following  convex  program: 


mm  \\\Z2PZj  -  C2)1  \\2g&}  +  r\\Z2PZj ||*+ 


( 

SI 

2 

T  S\ 

Z2P1 - 

+ 

ZXPT 1 - 

\ 

ms 

Q 

ms 

s.t.  1TP1  =  1,  PV}  >  0, 


(6.19) 


where  we  introduce  S  and  ms  to  denote  the  feature  matrix  and  the  size  of  the  entire  set  of  non¬ 
sequence  data  and  let  Z  denote  a  sub-sample  of  it,  mainly  to  limit  the  number  of  variables  when 
the  non-sequence  dataset  is  very  large.  As  shown  in  Appendix  C.l,  using  the  kernel  trick  and 
properties  of  the  matrix  trace  and  nuclear  norm,  we  re-write  the  objective  function  in  (6.19)  as 
follows  (dropping  constants): 

]-Tx{PtM2PM1)  -  Tr(PTF)  +  r||LjPL1||*+  (6.20) 

Zj 

7  / 

-1t(PtM2P  +  PAT,Pt)1  -  m1t(Pt/x2  +  P/ij), 

where  Tr(-)  is  the  matrix  trace,  M*  :=  Z-  Zt.  /x?  :=  Z'  51 ,  F  :=  ZjC2pZ\,  and  L,  is  a  finite 
matrix  such  that  Mi  =  LiLj .  To  set  r  in  a  proper  scale,  we  use  an  inequality  similar  to  (6.15) 
to  upper-bound  the  average  singular  value  of  Lj PLi,  and  then  replace  the  unknown  P  by  the 
uniform  distribution  to  have  r  :=  (A/m3)(||LjllTL1||00||LjllTL1||1)1'/2,  where  m  is  the  size 
of  P  and  A  >  0  takes  values  in  some  reasonable  range. 

As  mentioned  in  Section  6.2.1,  we  solve  (6.19)  with  a  variant  of  the  smoothing  proximal  gra¬ 
dient  (SPG)  method  outlined  in  Algorithm  6.1,  which  minimizes  /^(P),  a  smooth  approximation 
of  (6.20)  that  approximates  the  non-smooth  regularization  r||LjPL1||*  by 

9»{P)  ■=  max  rTx{YTLT2PLl)^^\\YfF,  (6.21) 


where  n  >  0  is  a  smoothing  parameter,  ||  ■  ||2  and  ||  •  \\p  denote  the  matrix  spectral  and  Frobe- 
nius  norms,  respectively.  Nesterov  [2005]  shows  that  (6.21)  is  continuously  differentiable  in  P 
and  V.9;i(P)  =  tL2Y*LJ ,  where  Y*  is  the  optimal  solution  to  (6.21)  obtained  by  projecting 
(r / n)Ll PLi  to  the  unit  spectral-norm  ball,  i.e.,  truncating  its  singular  values  at  1.  The  update 
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Algorithm  6.1  Smoothing  Proximal  Gradient  for  (6.19) 
Initialize  =  Pili)  to  some  feasible  point. 

Set  t  :=  0,  6 (0)  :=  1,  r/  :=  10,  and  7^0)  :=  1. 

repeat 

Find  the  smallest  k  G  (0, 1,  •  •  •  }  that  satisfies 


UP{t+1))  -  U(Y®)  <  -  yw \\l 

+  Tr((p(t+1)  -  y(t))Tv/M(y(t))) 


where  76+1)  77^6)  and 


P(t+i)  argn^n  || y(t)  _  V/„(y(t))/7(t+1)  -  P||] 

s.t.  Py  >  0, 1TP1  =  1. 


(6.22) 


0<t+ 1)  :=  (1  +  7/1  +  4(6|0))2)/2. 
y(i+l)  :=  p(*+l)  +  |g^l(p6+1)  -  pW). 
t  :=  t  +  1. 

until  convergence  or  t  —  Tmax,  an  iteration  limit. 


(6.22)  for  P-t+1)  requires  projection  onto  a  simplex,  for  which  several  efficient  algorithms  exist, 
such  as  the  sorting-based  method  proposed  by  Duchi  et  al.  [2008].  The  convergence  theory  of 
Chen  et  al.  [2012]  suggests  setting3  //  =  e/m,  m  being  the  column  dimension  of  Z2,  so  that  the 
objective  values  (6.20)  converge  in  0(l/e2)  iterations  to  at  most  e  plus  the  minimum. 

We  then  compute  the  top  k  left  singular  vectors  of  C2.i  in  a  similar  way  to  Kernel  Principle 
Component  Analysis  [Scholkopf  et  al.,  1998],  starting  with  the  fact  that  any  left  singular  vector 
of  C2,i  =  Z2PZj  can  be  expressed  as  Z2cx  for  some  a  e  Mm,  and  any  left  singular  vector  of 
C2j  is  an  Eigenvector  of  C22C/A  and  vice  versa.  Thus  we  have 

Z2PMiPt  M2cx  =  Z2PZj  Z]  PTZj  (Z2a)  =  uZ2a 

M2PM1Pt M2ol  =  ojM2ol ,  (6.23) 

which  is  a  generalized  Eigensystem.  Let  denote  the  diagonal  matrix  formed  by  the  top  k 
generalized  Eigenvalues  of  (6.23),  and  A  denote  the  column  concatenation  of  the  corresponding 
generalized  Eigenvectors.  It  is  then  clear  that  D  :=  (A1  M2A)~lP  is  diagonal,  and  we  obtain  a 
concise  form  of  C2. i’s  top  k  left  singular  vectors  U  =  Z2AD.  We  also  have  the  following  useful 
identity: 

M2PMxPt  M2A  =  M2AVL.  (6.24) 

3For  solving  (6.14)  we  set  =  e/N. 
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Next  we  describe  our  estimators  for  the  observable  representation.  First  we  have 

/3,  :=  UTSl/ms  =  DAt^2,  (6.25) 

3oo  :=  C2)i(WTC2,i)t  =  Z2PM1PT M2ADn~1  (6.26) 

by  using  the  identity  (WT£2ii)^  =  ZiPtM2ADH_1  established  from  properties  of  pseudo 

inverse,  (6.24),  and  the  definition  of  D.  To  derive  our  estimator  for  Bx  defined  in  (6.13),  we  start 
from  the  conditional  covariance  operator  defined  by  Song  et  al.  [2009] 

C3,i|2  :=  C3^2C22(p(x),  where 

£3,1,2  :=  E xt+2Xtxt+1[4>(Xt+2 )  ®  <j>(Xt)  <8>  (f>(Xt+ 1)], 

£2,2  :=  EXt+1[<f>(Xt+ 1)  (8)  0(Xt+i)]. 

Using  a  similar  idea  to  (6.18),  we  encode  the  empirical  distribution  of  triples  of  consecutive 
observations  by  a  third-order  tensor  Q  and  have  the  following  estimator 


(~Z2Zj  +  vl) 
\m  J 


where  Z3  :  =  [<f>3  Z],  v  >  0  is  a  regularization  parameter,  and  superscripts  denote  column 
indices.  We  then  define  our  estimator  for  £>x  as 

Px  :=  (WT(£3,i|20(x)))(WTC2il)t  (6.27) 

=  m  (M2  +  urnl)  lZj(f)(^  ,  (6.28) 

1  1 


where  Bi  e  M.kxk  is  a  linear  transformation  of  Q..i  e  Mmxm,  the  Zth  slice  of  0  along  the  third 
dimension: 

By  :=  UTZzQ..lZj (WT£2jl)t.  (6.29) 

Note  that  in  the  usual  setting  of  learning  from  dynamic  data,  the  third-order  tensor  Q  is  diagonal 
and  Bi  becomes  a  rank-one  matrix,  so  (6.28)  reduces  to  the  estimator  proposed  by  Song  et  al. 
[2010], 

The  definitions  above  naturally  lead  to  an  estimation  procedure  that  first  estimates  Q  and 
then  applies  (6.29)  to  estimate  Bi.  However,  such  a  procedure  involves  m3  variables  when  the 
quantities  of  interest  consist  of  only  km2  variables.  We  thus  propose  to  estimate  /i;’s  directly. 
Viewing  (6.29)  as  the  solution  to 

argmin  || Q..t  —  UBiVT\\2F,  where 

Bi 

U:=(UTZ3y  =  (DATZjZ3y  =  (DATM23)\ 

VT  :=  (Z^(UrC2>1)fy  =  (M1Pr M2ADfl~iy , 
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(6.30) 


we  propose  to  estimate  B{ s  by  the  following: 

arg min  \\\C^2{{Bi})  -  C3.1:2\\2g®g®g+ 

2  (ll^3,-,2({5/})  -  C2) l||g®g  +  ||C,1,2({5/})T  -  C2,l||g0gj 

in  which 

C3,i,2({^})  :=  Y/(UBlVT)ljZl  ®  4  ®  4,  (6.31) 

i,j,l 

C3„2({A})  :=  ]T(4b4T)^3  <g>  f  '  ®  (6.32) 

i,j,l 

C.X2{{B{})  :=  Y,(UBiVT)^TZi  ®  4  ®  4.  (6.33) 

i,j,l 

Again,  our  estimation  objective  consists  of  a  squared  error  term  on  the  observed  tri-variance  and 
two  regularization  terms  on  the  deviation  of  the  marginals  C3,..2  and  C.  \  2  from  our  estimated  co- 
variance  C2)i.  As  shown  in  Appendix  C.2,  we  use  kernel  tricks  to  re-write  the  objective  function 
(6.30)  in  terms  of  finite -dimensional  quantities.  Moreover,  by  re-defining  the  notation  B  to  be  a 
£;2-by  -m  matrix  whose  Z-th  column  denotes  the  column  concatenation  of  the  k-by-k  matrix  Bi, 
we  obtain  the  following  succinct  form  of  (6.30)  (dropping  constants): 

min  -Tr {BTCBM2)  -  Tr(  JT B)  (6.34) 

b  2 

with  an  analytical  solution  C_1  JM2_1,  where  C  and  J  are  defined4  in  Appendix  C.2. 

6.3  Experiments 

We  compare  our  proposed  methods  with  the  original  spectral  algorithms  (Section  6.1)  that  only 
use  sequence  data.  In  the  case  of  discrete  observations  we  conduct  a  simulation  study,  and  we 
apply  the  algorithms  for  continuous  observations  to  an  activity  monitoring  dataset. 

6.3.1  Simulation 

We  create  a  discrete  HMM  with  20  states  and  100  observation  symbols.  The  state  transition 
probability  matrix  is  of  rank  nearly  7.  The  heatmaps  of  the  state  transition  probability  and  the 
state-conditioned  observation  probability  matrices  are  in  Figures  6.1(a)  and  6.1(b).  From  this 
HMM  we  generate  50  datasets,  each  containing  a  training  sequence  of  length  1000  initialized 
from  the  stationary  distribution  as  the  sequence  data,  a  set  of  10000  observations  independently 
drawn  from  the  stationary  distribution  as  the  non-sequence  data,  and  a  testing  sequence  of  length 
1000,  also  initialized  from  the  stationary  distribution.  We  set  the  dimension  k  —  7,  and  for  the 
proposed  estimate  set  u  =  100  and  A  =  15.  We  then  perform  filtering  and  prediction  along  the 
testing  sequence.  To  give  bounds  on  the  prediction  performance,  we  also  give  prediction  results 
by  the  true  observable  representation  and  the  stationary  distribution. 

4When  the  kernel  is  positive  definite,  it  is  easy  to  verify  that  both  C  and  M2  are  invertible. 
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Figure  6.1:  Discrete  HMM  model  parameters 


Table  6.1:  Paired  t  test  results.  Each  cell  shows  the  number  of  testing  time  points  at  which 
the  row  method  outperforms  the  column  method  statistically  significantly.  The  total  number  of 
testing  time  points  is  999. 


true 

proposed 

sequence-only 

stationary 

true 

827 

999 

975 

proposed 

0 

999 

470 

sequence-only 

0 

0 

0 

stationary 

0 

0 

999 

Figure  6.2  shows  the  median  testing  log-likelihood  over  50  experiments  at  each  testing  time 
point.  The  proposed  estimator  outperforms  the  sequence-only  estimator  at  most  time  points.  For 
each  pair  among  the  four  predictions,  we  performed  paired  t- tests  of  their  testing  likelihoods  at 
all  time  points,  and  counted  the  number  of  time  points  at  which  one  prediction  outperforms  the 
other  statistically  significantly.  The  results  are  in  Table  6.1.  The  proposed  estimator  predicts 
better  than  the  sequence-only  estimator  at  all  time  points  and  the  stationary  distribution  at  many 
time  points,  but  these  two  other  methods  never  predict  significantly  better  than  the  proposed 
method.  It  is  surprising  that  the  sequence-only  estimator  performs  even  worse  than  the  station¬ 
ary  distribution.  As  pointed  out  by  Siddiqi  et  al.  [2010],  the  filtering  and  prediction  steps  (6.6) 
do  not  guarantee  non-negativity  of  the  probability  estimates,  especially  when,  as  in  the  current 
experiment,  there  is  few  dynamic  data.  Indeed,  we  observe  quite  a  few  negative  values  in  the 
sequence-only  estimates  and  replace  them  with  10~12.  This  is  an  indication  of  unreliable  esti¬ 
mates  leading  to  poor  prediction.  On  the  contrary,  the  proposed  estimates  almost  always  take 
non-negative  values. 
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Figure  6.2:  Median  testing  log-likelihood.  The  y-axis  lower  limit  is  set  to  -6  for  better  visualiza¬ 
tion;  the  red  dashed  line  actually  takes  values  as  small  as  -17. 

6.3.2  IMU  Measurements  of  Human  Activities 

The  PAMAP2  physical  activity  monitoring  dataset  [Reiss  and  Strieker,  2012]  contains  recordings 
of  18  different  physical  activities  performed  by  9  subjects  wearing  3  inertial  measurement  units 
(IMUs)  and  a  heart-rate  monitor.  Each  subject  follows  a  protocol  to  perform  a  sequence  of 
activities  with  breaks  in  between.  For  our  experiment  we  use  data  collected  from  subject  101 
while  walking  and  running.  We  focus  our  experiment  on  recordings  from  the  three  IMUs,  and 
for  each  IMU  only  use  the  3D-acceleration  data  (ms-2)  with  scale  ±16g,  as  recommended  by 
the  authors,  and  the  3D-gyroscope  data  (rad/s),  resulting  in  an  observation  space  of  6  x  3  =  18 
dimensions.  Subject  101  performs  walking  and  running  for  approximately  3.5  minutes  each,  and 
we  discard  the  first  and  the  last  10  seconds  of  data  to  remove  transitioning  between  activities.  To 
make  the  experiment  more  interesting,  we  break  the  IMU  recordings  into  short  segments  of  10 
seconds  each  and  interleave  the  walking  segments  with  the  running  ones  to  generate  a  sequence 
of  alternating  activities.  The  IMUs  operate  at  a  sampling  frequency  of  100Hz,  so  each  segment 
has  1000  data  points  and  the  entire  sequence  has  39265  data  points.  We  normalize  each  of  the  18 
dimensions  to  be  zero-mean  and  standard  deviation  1.  Figure  6.3  shows  one  of  the  dimensions 
from  the  first  2000  data  points,  revealing  significant  differences  between  walking  and  running. 

We  take  the  last  4256  data  points  as  the  testing  sequence,  and  generate  10  training  datasets 
as  follows.  We  randomly  sample  n  triples  of  consecutive  observations  from  the  first  35000  data 
points  as  the  sequence  data,  and  another  non-overlapping  set  of  m  +  ms  single  observations  as 
the  non-sequence  data,  in  which  m  points  are  used  to  form  Z  and  the  rest  ms  points  constitute 
S  in  the  proposed  algorithm.  The  values  of  n,m,  and  ms  are:  n  G  (25,50, 100,200},  m  G 
(500, 1000},  and  ms  =  4000.  We  use  the  Gaussian  RBF  kernel  /c(x,  x')  :=  exp(||x  —  x'||2/a2), 
and  set  a2  to  be  half  of  the  median  squared  pairwise  distances  of  the  sequence  data.  The  dimen- 
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Figure  6.3:  First- axis  acceleration  from  the  wrist  IMU 


5 

4 

3 


(a)  Boxplots  of  median  prediction  errors 


n=25  n=50  n=100  n=200 


Figure  6.4:  Prediction  performance  on  the  IMU  data.  The  black-dashed  line  is  obtained  by  using 
n  =  5000  dynamic  data  points,  serving  as  a  performance  limit. 


sion  k,  i.e.,  the  number  of  top  left  singular  vectors,  is  set  to  20  for  n  =  25  and  50  for  the  rest. 
The  proposed  algorithm  has  three  regularization  parameters:  up  and  A  in  (6.19)  and  up  in  (6.34). 
We  determine  these  parameters  by  minimizing  5-fold5  cross  validation  error  on  the  sequence 
data  over  a  cube  of  values  (log 2up,  log2  A,  log2 up)  G  {—8,  —6, . . . ,  6}  x  (—9,  —7, . . . ,  1}  x 
{-5, -3,...,  9}. 

After  learning  the  model  parameters,  we  perform  filtering  and  prediction  along  the  testing 
sequence.  As  mentioned  in  Section  6.1,  the  Hilbert  space  embedding  of  the  predictive  distribu¬ 
tion  takes  the  form  of  a  non-parametric  density  estimator  thanks  to  the  Gaussian  RBF  kernel, 
and  we  predict  the  next  observation  by  selecting  from  S,  the  ms  static  data  points,  the  one  with 
the  highest  predictive  density.  For  each  predicted  observation  we  compute  the  squared  error 
against  the  true  observation,  and  for  each  predicted  sequence  we  take  the  median  and  the  mean 
of  the  squared  prediction  errors  as  sequence- wise  performance  indicators.  Figure  6.4(a)  gives  the 
boxplot  of  the  10  median  prediction  errors,  showing  that  the  proposed  method  of  incorporating 
static  data  improves  on  the  prediction  performance  more  significantly  when  the  sequence  data 
size  n  is  small.  Figure  6.4(b)  gives  the  boxplot  of  the  10  means,  demonstrating  a  similar  trend  of 

5We  only  split  the  sequence  data  but  not  the  static  data. 
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improvement  except  when  n  =  50.  Looking  more  into  that  result,  we  find  that  it  is  the  running 
part  of  the  testing  sequence  the  proposed  method  fails  to  predict  better,  possibly  due  to  the  more 
extreme  values  and  changes  in  its  IMU  readings,  as  shown  in  Figure  6.3. 


6.4  Discussions  and  Conclusions 

We  propose  spectral  learning  algorithms  for  HMMs  that  incorporate  static  data  as  regularization. 
Experiments  on  synthetic  and  real  human  activities  data  demonstrate  a  clear  advantage  of  us¬ 
ing  static  data  when  sequence  data  is  limited.  There  are  several  interesting  directions  for  future 
work,  including  deriving  theoretical  guarantees  for  the  proposed  methods  and  solving  real  prob¬ 
lems  where  sequence  data  is  much  more  difficult  to  obtain  than  non-sequence  data.  In  terms  of 
methodology,  a  possible  improvement  is  to  combine  the  two  stages  in  the  proposed  methods  into 
one  optimization  problem,  where  the  optimization  variable  is  a  three-way  tensor  representing  the 
joint  probability  of  observation  triples,  and  the  objective  takes  a  similar  form  of  an  error  term 
on  sequence  data  plus  regularization  terms  based  on  non-sequenced  data.  Given  an  estimate  for 
the  three-way  probability  tensor,  lower-order  probabilities  can  be  easily  obtained  by  marginal¬ 
ization,  and  then  spectral  learning  algorithms  in  Section  6.1  can  be  applied.  One  advantage  of 
such  a  procedure  is  that  the  estimates  of  the  probability  matrix  and  tensor  are  inherently  consis¬ 
tent,  and  therefore  the  sub-spaces  computed  by  spectral  decomposition  are  optimal  with  respect 
to  both,  whereas  in  the  proposed  two-stage  methods,  the  sub-spaces  are  optimal  with  respect  to 
only  the  estimated  joint  probability  matrix.  The  downside  is  obviously  the  optimization  in  the 
space  of  three-way  tensors,  which  is  computationally  intensive  in  terms  of  both  time  and  storage. 

Although  not  explicitly  described  in  this  chapter,  it  is  possible  to  extend  the  regular  sequence- 
based  EM  learning  algorithm  for  HMMs  to  make  use  of  non-sequence  data  drawn  from  the 
stationary  distribution.  More  specifically,  such  non-sequence  data  can  be  easily  incorporated 
into  the  EM  estimation  procedure  for  parameters  in  the  observation  model,  e.g.,  the  state-specific 
mean  observation  vectors  and  noise  covariances  in  a  Gaussian  observation  model,  because  these 
parameters  are  time-independent.  However,  as  with  the  regular  EM  approach,  finding  a  good 
local  optimum  is  always  an  issue  and  may  require  a  lot  of  tuning. 
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Chapter  7 

Learning  Bi-clustered  Vector 
Auto-regressive  Model 


In  this  chapter  we  return  to  the  usual  setting  of  learning  from  sequence  data,  and  consider  learn¬ 
ing  structured  Vector  Auto-regressive  (VAR)  models.  Although  not  directly  related  to  the  main 
theme  of  the  thesis,  the  methods  developed  here,  as  we  explain  later,  can  benefit  learning  from 
non-sequence  data.  Our  motivation  is  from  the  use  of  VARs  for  analyzing  the  temporal  de¬ 
pendencies  in  multivariate  time  series  data,  known  as  Granger  causality 1  [Granger,  1969].  For 
example,  recently  researchers  in  computational  biology,  using  ideas  from  sparse  linear  regres¬ 
sion,  developed  sparse  estimation  techniques  for  VAR  models  [Fujita  et  al.,  2007;  Lozano  et  al., 
2009;  Shojaie  et  al.,  2011]  to  leam  from  high-dimensional  genomic  time  series  a  small  set  of 
pairwise,  directed  interactions,  referred  to  as  gene  regulatory  networks,  some  of  which  lead  to 
novel  biological  hypotheses. 

While  individual  edges  convey  important  information  about  interactions,  it  is  often  desir¬ 
able  to  obtain  an  aggregate  and  more  interpretable  description  of  the  network  of  interest.  One 
useful  set  of  tools  for  this  purpose  are  graph  clustering  methods  [Schaeffer,  2007],  which  iden¬ 
tify  groups  of  nodes  or  vertices  that  have  similar  types  of  connections,  such  as  a  common 
set  of  neighboring  nodes  in  undirected  graphs,  and  shared  parent  or  child  nodes  in  directed 
graphs.  These  methods  have  been  applied  in  the  analysis  of  various  types  of  networks,  such 
as  [Girvan  and  Newman,  2002],  and  play  a  key  role  in  graph  visualization  tools  [Herman  et  al., 
2000], 

Motivated  by  the  wide  applicability  of  the  above  two  threads  of  work  and  the  observation  that 
their  goals  are  tightly  coupled,  we  develop  a  methodology  that  integrates  both  types  of  analyses, 
estimating  the  underlying  Granger  causal  network  and  its  clustering  structure  simultaneously . 
One  can  image  that  such  a  structure,  once  estimated,  could  be  used  as  prior  knowledge  for  other 
learning  tasks  in  the  same  domain,  and  as  suggested  in  Chapter  3,  such  prior  knowledge  may  aid 
learning  VARs  from  non-sequence  data  by  providing  better  regularization  of  the  model. 

In  this  chapter  we  use  the  following  notation  for  a  first-order  /^-dimensional  VAR  model: 

X(t)  =  x(t_i)A +  €(*),:  e(t)  ~  AT (0,  cr2I),  (7.1) 

1  More  precisely,  graphical  Granger  causality  for  more  than  two  time  series. 
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where  X(t)  E  M 1  xp  denotes  the  vector  of  variables  observed  at  time  t,  A  E  Mpxp  is  known  as  the 
transition  matrix,  whose  non-zero  entries  encode  Granger- causal  relations  among  the  variables, 
and  e(t)’s  denote  independent  noise  vectors  drawn  from  a  zero-mean  Gaussian  with  a  spherical 
covariance  a2 1.  Our  goal  is  to  obtain  a  transition  matrix  estimate  A  that  is  both  sparse ,  leading 
directly  to  a  Granger-causal  network,  and  clustered  so  that  variables  sharing  a  similar  set  of  con¬ 
nections  are  grouped  together.  Since  the  rows  and  the  columns  of  A  indicate  different  roles  of 
the  variables,  the  former  revealing  how  variables  affect  themselves  and  the  latter  showing  how 
variables  get  affected,  we  consider  the  more  general  bi-clustering  setting,  which  allows  two  dif¬ 
ferent  sets  of  clusters  for  rows  and  columns,  respectively.  We  take  a  nonparametric  Bayesian 
approach,  placing  over  A  a  nonparametric  bi-clustered  prior  and  carrying  out  full  posterior  infer¬ 
ences  via  a  blocked  Gibbs  sampling  scheme.  Our  simulation  study  demonstrates  that  when  the 
underlying  VAR  model  exhibits  a  clear  bi-clustering  structure,  our  proposed  method  improves 
over  some  natural  alternatives,  such  as  adaptive  sparse  learning  methods  [Zou,  2006]  followed 
by  bi-clustering,  in  terms  of  model  estimation  accuracy,  clustering  quality,  and  forecasting  ca¬ 
pability.  More  encouragingly,  on  a  real-world  T-cell  activation  gene  expression  time  series  data 
set  [Rangel  et  al.,  2004]  our  proposed  method  finds  an  interesting  bi-clustering  structure,  which 
leads  to  a  biologically  more  meaningful  interpretation  than  those  by  some  state-of-the  art  time 
series  clustering  methods. 

Before  introducing  our  method,  we  briefly  discuss  related  work  in  Section  7.1.  Then  we 
define  our  bi-clustered  prior  in  Section  7.2,  followed  by  our  sampling  scheme  for  posterior  infer¬ 
ences  in  Section  7.3.  Lastly,  we  report  our  experimental  results  in  Section  7.4  and  conclude  with 
Section  7.5. 


7.1  Related  work 

There  has  been  a  lot  of  work  on  sparse  estimation  of  Granger-causal  networks  under  VAR  mod¬ 
els,  and  perhaps  even  more  on  graph  clustering.  However,  to  the  best  of  our  knowledge,  none  of 
them  has  considered  the  simultaneous  learning  scheme  we  propose  here.  Some  of  the  more  recent 
sparse  VAR  estimation  work  [Lozano  et  al.,  2009;  Shojaie  et  al.,  2011]  takes  into  account  depen¬ 
dency  further  back  in  time  and  can  even  select  the  right  length  of  history,  known  as  the  order 
of  the  VAR  model.  While  focusing  on  first-order  VAR  models,  we  observe  that  it  is  possible  to 
extend  our  method  to  leam  higher-order  bi-clustered  VAR  models,  where  the  same  bi-clustering 
structure  is  shared  by  all  the  time-lagged  transition  matrices,  an  extension  to  the  grouped  graph¬ 
ical  Granger  modeling  approach  of  Lozano  et  al.  [2009]. 

Another  large  body  of  related  work  [e.g.,  Busygin  et  al.,  2008;  Meeds  and  Roweis,  2007; 
Porteous  et  al.,  2008]  concerns  bi-clustering  (or  co-clustering)  a  data  matrix,  which  usually  con¬ 
sists  of  relations  between  two  sets  of  objects,  such  as  user  ratings  on  items,  or  word  occurrences 
in  documents.  Most  of  this  work  models  data  matrix  entries  by  mixtures  of  distributions  with 
different  means,  representing,  for  example,  different  mean  ratings  by  different  user  groups  on 
item  groups.  In  contrast,  common  regularization  schemes  or  prior  beliefs  for  VAR  estimation 
usually  assume  zero-mean  entries  for  the  transition  matrix,  biasing  the  final  estimate  towards 
being  stable.  Following  such  a  practice,  our  method  models  transition  matrix  entries  as  scale 
mixtures  of  zero-mean  distributions. 
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Finally,  clustering  time  series  data  has  been  an  active  research  topic  in  a  number  of  areas, 
in  particular  computational  biology.  However,  unlike  our  Granger  causality  based  bi-clustering 
method,  most  of  the  existing  work,  such  as  [Cooke  et  al.,  2011;  Ramoni  et  al.,  2002]  and  the 
references  therein,  focus  on  grouping  together  similar  time  series,  with  a  wide  range  of  simi¬ 
larity  measures  from  simple  linear  correlation  to  complicated  Gaussian  process  based  likelihood 
scores.  Differences  between  our  method  and  existing  similarity-based  approaches  are  demon¬ 
strated  in  Section  7.4  through  both  simulations  and  experiments  on  real  data. 


7.2  Bi-clustered  prior 


We  treat  the  transition  matrix  A  e  lZpxp  as  a  random  variable  and  place  over  it  a  “bi-clustered” 
prior,  as  defined  by  the  following  generative  process: 


TTU  ~  Stick-Break(an), 

{ui}i<i<p  l~d  Multinomial^), 

{Afci}l<fc,Koo  ~ 

Aij  ~ 


7TV  ~  Stick-Break(«„), 

{vj}i<j<p  Multinomial^), 
Gamma(/i,c), 

Laplace(0,  l/AUi„.),  1  <  i,j  <  p. 


(7.2) 

(7.3) 


The  process  starts  by  drawing  row  and  column  mixture  proportions  7vu  and  -kv  from  the  “stick¬ 
breaking”  distribution  [Sethuraman,  1994],  denoted  by  Stick-Break(a)  and  defined  on  an  infinite¬ 
dimensional  simplex  as  follows: 


pk  ~  Beta(l,a), 

m<k 


(7.4) 


where  a  >  0  controls  the  average  length  of  pieces  broken  from  the  stick,  and  may  take  different 
values  au  and  av  for  rows  and  columns,  respectively.  Such  a  prior  allows  for  an  infinite  number 
of  mixture  components  or  clusters,  and  lets  the  data  decide  the  number  of  effective  components 
having  positive  probability  masses,  thereby  increasing  modeling  flexibility.  The  process  then 
samples  row-cluster  and  column-cluster  indicator  variables  ui’’ s  and  v/s  from  mixture  propor¬ 
tions  7TU  and  7rv,  and  for  the  A  -th  row-cluster  and  the  Z-th  column-cluster  draws  an  inverse-scale, 
or  rate  parameter  A/c/  from  a  Gamma  distribution  with  shape  parameter  h  and  scale  parameter 
c.  Finally,  the  generative  process  draws  each  matrix  entry  AtJ  from  a  zero-mean  Laplace  dis¬ 
tribution  with  inverse  scale  AUiVj,  such  that  entries  belonging  to  the  same  bi-cluster  share  the 
same  inverse  scale,  and  hence  represent  interactions  of  similar  magnitudes,  whether  positive  or 
negative. 

The  above  bi-clustered  prior  subsumes  a  few  interesting  special  cases.  In  some  applications 
researchers  may  believe  the  clusters  should  be  symmetric  about  rows  and  columns,  which  cor¬ 
responds  to  enforcing  u  =  v.  If  they  further  believe  that  within-cluster  interactions  should  be 
stronger  than  between-cluster  ones,  they  may  adjust  accordingly  the  hyper-parameters  in  the 
Gamma  prior  (7.2),  or  as  in  the  group  sparse  prior  proposed  by  Marlin  et  al.  [2009]  for  Gaussian 
precision  estimation,  simply  require  all  within-cluster  matrix  entries  to  have  the  same  inverse 
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Algorithm  7.1  Blocked  Gibbs  Sampler 

Input:  Data  X  and  Y,  hyper-parameters  h,c,au,av,  and  initial  values  u(0), 

v(°);  (cr^)2 

Output:  Samples  from  the  full  joint  posterior  p(A,  L,  u,  v,  a2  |  X,  Y ) 

Set  iteration  t  —  1 

repeat 

for  %  =  1  to  p  do 

Af  ~  P(A<  | 

end  for 

for  i l  —  1  to  p  do 
end  for 

for  j  =  1  to  p  do 

vf  ~  p(vj  |  A(t),  u(t),  v^j),  v^11)):p,  (fft*"1))2,  L(t_1),  X,  7) 

end  for 

(o-W)2  ^  p(a2  |  A(t\vL{VMt\L(t-1\X,Y) 

LW  ~  p(L  |  Aw,u(t),vw,  (cr(t))2,X,  F) 

Increase  iteration  f 


until  convergence 

Notations:  superscript  (f)  denotes  iteration,  /l,  denotes  the  i-th  row  of  A,  At:J  denotes  the 
sub-matrix  in  A  from  the  i-th  until  the  j-th  row,  and  u.,:J  denotes  {  «„},:<„.<,• 


scale  constrained  to  be  smaller  than  the  one  shared  by  all  between-cluster  entries.  Our  inference 
scheme  detailed  in  the  next  section  can  be  easily  adapted  to  all  these  special  cases. 

There  can  be  interesting  generalizations  as  well.  For  example,  depending  on  the  application 
of  interest,  it  may  be  desirable  to  distinguish  positive  interactions  from  negative  ones,  so  that 
a  bi-cluster  of  transition  matrix  entries  possess  not  only  similar  strengths,  but  also  consistent 
signs.  However,  such  a  generalization  requires  a  more  delicate  per-entry  prior  and  therefore  a 
more  complex  sampling  scheme,  which  we  leave  as  an  interesting  direction  for  future  work. 

7.3  Posterior  inference 

Let  L  denote  the  collection  of  Aw’s,  u  and  v  denote  {?/.,}  i<j<?,  and  {vj}\<j<P,  respectively. 
Given  one  or  more  time  series,  collectively  denoted  as  matrices  X  and  Y  whose  rows  represent 
successive  pairs  of  observations,  i.e., 

Y  =  XiA  +  e,  e  ~  AT(0 ,o2I), 

we  aim  to  carry  out  posterior  inferences  about  the  transition  matrix  A,  and  row  and  column  cluster 
indicators  u  and  v.  To  do  so,  we  consider  sampling  from  the  full  joint  posterior  p(A,  L.  u,  v,  a2  \ 
X,Y),  and  develop  an  efficient  blocked  Gibbs  sampler  outlined  in  Algorithm  7.1.  Starting 
with  some  reasonable  initial  configuration,  the  algorithm  iteratively  samples  rows  of  A,  row 
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and  column-cluster  indicator  variables  u  and  v,  the  noise  variance2  cr2,  and  the  inverse  scale 
parameters  L  from  their  respective  conditional  distributions.  Next  we  describe  in  more  details 
sampling  from  those  conditional  distributions. 


7.3.1  Sampling  the  transition  matrix  A 

Let  /!_,  denote  the  sub-matrix  of  A  excluding  the  / - 1 h  row,  X[  and  XX  denote  the  i-th  column 
of  X  and  the  sub-matrix  of  X  excluding  the  i-th  column.  Algorithm  7.1  requires  sampling  from 
the  following  conditional  distribution: 

p{Ai  |  A_j,  u,  v,  a2,  L,  X,  Y)  oc  JJ  J\f(Ai:j  \  of)  Laplace^-  |  0,  l/XUiVj), 

i  <j<p 

where 

Hi  ■=  (X'JKWl ViY-XXA-i)',  :=  °2IUX- 

Therefore,  all  we  need  is  sampling  from  univariate  densities  of  the  form: 

f(x)  oc  J\f(x  |  yu,  a2) Laplace (x  |  0, 1  / A) ,  (7.5) 


whose  c.d.f.  F(x)  can  be  expressed  in  terms  of  the  standard  normal  c.d.f.  $(•): 


F(x)  =  Lj 

C  V  cr 


+  ^x+  ~  it  ~ (j2A) 


H  —  a2  A 


cr 


where  x  :=  min(x,  0),  x+  :=  max(x,  0),  and 


C  := 

r  A 
Ci  :=  -exp 


_^)+c,(l_4(_^)), 

C2:=^exp(^^M). 


We  then  sample  from  /(x)  with  the  inverse  c.d.f.  method.  To  reduce  the  potential  sampling  bias 
introduced  by  a  fixed  sampling  schedule,  we  follow  a  random  ordering  of  the  rows  of  A  in  each 
iteration. 

Algorithm  7.1  generates  samples  from  the  full  joint  posterior,  but  sometimes  it  is  desirable 
to  obtain  a  point  estimate  of  A.  One  simple  estimate  is  the  (empirical)  posterior  mean;  however, 
it  is  rarely  sparse.  To  get  a  sparse  estimate,  we  carry  out  the  following  “sample  EM”  step  after 
Algorithm  7.1  converges: 


((Biclus-EM 


arg  max  £  log p(A  |  u(t),v(t),(crw)2,Lw,X,y'), 

t. 


(7.6) 


where  t  starts  at  a  large  number  and  skips  some  fixed  number  of  iterations  to  give  better-mixed 
and  more  independent  samples.  The  optimization  problem  (7.6)  is  in  the  form  of  sparse  least 
square  regression,  which  we  solve  with  a  simple  coordinate  descent  algorithm. 

2Our  sampling  scheme  can  be  easily  modified  to  handle  diagonal  covariances. 
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7.3.2  Sampling  row  and  cluster  indicators 

Since  our  sampling  procedures  for  u  and  v  are  symmetric,  we  only  describe  the  one  for  u.  It  can 
be  viewed  as  an  instantiation  of  the  general  Gibbs  sampling  scheme  studied  by  Meeds  and  Roweis 
[2007].  According  to  our  model  assumption,  u  is  independent  of  the  data  X,  Y  and  the  noise 
variance  a2  conditioned  on  all  other  random  variables.  Moreover,  under  the  stick-breaking  prior 
(7.4)  over  the  row  mixture  proportions  nu  and  some  fixed  v,  we  can  view  u  and  the  rows  of  A  as 
cluster  indicators  and  samples  drawn  from  a  Dirichlet  process  mixture  model  with  Gamma(/i,  c) 
as  the  base  distribution  over  cluster  parameters.  Finally,  the  Laplace  distribution  and  the  Gamma 
distribution  are  conjugate  pairs,  allowing  us  to  integrate  out  the  inverse  scale  parameters  L  and 
derive  the  following  “collapsed”  sampling  scheme: 
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where  1(  - )  is  the  Gamma  function,  dn/,  denotes  the  Kronecker  delta  function,  N_t  [k]  is  the  size 
of  the  k- th  row-cluster  excluding  At,  M[l\  is  the  size  of  the  Z-th  column-cluster,  and 


11^4— i[M]||i  E 

s^i,us=k,Vj=l  vj=l 

As  in  the  previous  section,  we  randomly  permute  s  and  v/s  in  each  iteration  to  reduce  sam¬ 
pling  bias,  and  also  randomly  choose  to  sample  u  or  v  first. 

Just  as  with  the  transition  matrix  A,  we  may  want  to  obtain  point  estimates  of  the  cluster 
indicators.  The  usual  empirical  mean  estimator  does  not  work  here  because  the  cluster  labels 
may  change  over  iterations.  We  thus  employ  the  following  procedure: 

1.  Construct  a  similarity  matrix  S  such  that 

&ij  -=  y  ^  1  ^vYvA'1  ’  4  Y^ij  Y  P) 

t  1  3 

where  t  selects  iterations  to  approach  mixing  and  independence  as  in  (7.6),  and  T  is  the 
total  number  of  iterations  selected. 

2.  Run  normalized  spectral  clustering  [Ng  et  al.,  2001]  on  S,  with  the  number  of  clusters  set 
according  to  the  spectral  gap  of  S. 
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7.3.3  Sampling  noise  variance  and  inverse  scale  parameters 

On  the  noise  variance  a2  we  place  an  inverse-Gamma  prior  with  shape  a  >  0  and  scale  3  >  0, 
leading  to  the  following  posterior: 

a2\A,X,Y  ~  l-Gamma(a  +  pT/2,2\\Y  —  XA\\j3  +  (3),  (7.7) 

where  T  is  the  number  of  rows  in  X  and  ||  •  \\F  denotes  the  matrix  Frobenius  norm.  Due  to  the 
conjugacy  mentioned  in  the  last  section,  the  inverse  scale  parameters  A^’s  have  the  following 
posterior: 

Aw|A,u,v  ~  Gamma(iV[/c]M[/]  +  h,  (\\A[k,  /]||i  +  l/c)_1). 

7.4  Experiments 

We  conduct  both  simulations  and  experiments  on  a  real  gene  expression  time  series  dataset,  and 
compare  the  proposed  method  with  two  types  of  approaches: 

Learning  VAR  by  sparse  linear  regression,  followed  by  bi-clustering 

Unlike  the  proposed  method,  which  makes  inferences  about  the  transition  matrix  A  and  cluster 
indicators  jointly,  this  natural  baseline  method  first  estimates  the  transition  matrix  by  adaptive 
sparse  or  Lx  linear  regression  [Zou,  2006]: 

ALl  :=  argminh\Y  -  XA\\2F  + (7.8) 

A  I  z — /  Aoh  7 

i,j  1  *? 1 

where  /lok  denotes  the  ordinary  least-square  estimator,  and  then  bi-clusters  ALl  by  either  the 
cluster  indicator  sampling  procedure  in  Section  7.3.2  or  standard  clustering  methods  applied  to 
rows  and  columns  separately.  We  compare  the  proposed  method  and  this  baseline  in  terms  of 
predictive  capability,  clustering  performance,  and  in  the  case  of  simulation  study,  model  estima¬ 
tion  error. 

Clustering  based  on  time  series  similarity 

As  described  in  Section  7.1,  existing  time  series  clustering  methods  are  designed  to  group  to¬ 
gether  time  series  that  exhibit  a  similar  behavior  or  dependency  over  time,  whereas  our  proposed 
method  clusters  time  series  based  on  their  (Granger)  causal  relations.  We  compare  the  pro¬ 
posed  method  with  the  time  series  clustering  method  proposed  by  Cooke  et  al.  [2011],  which 
models  time  series  data  by  Gaussian  processes  and  performs  Bayesian  Hierarchical  Clustering 
[Heller  and  Ghahramani,  2005],  achieving  state-of-the  art  clustering  performances  on  the  real 
genes  time  series  data  used  in  Section  7.4. 

7.4.1  Simulation 

We  generate  a  transition  matrix  A  of  size  100  by  first  sampling  entries  in  bi-clusters: 

{Laplace(0,  41  <  i  <  70, 51  <j<  80, 

Laplace(0,  \Z70~1),  71  <  i  <  90, 1  <  j  <  50,  (7.9) 

Laplace(0,  ^TuT1),  91  <  i  <  100, 1  <  j  <  100, 
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Figure  7.1:  Heat  maps  of  the  synthetic  bi-clustered  VAR 


Figure  7.2:  Prediction  errors  up  to  10  time  steps.  Errors  for  longer  horizons  are  close  to  those  by 
the  mean  (zero)  prediction,  shown  in  black  dashed  line,  and  are  not  reported. 


and  then  all  the  remaining  entries  from  a  sparse  back-ground  matrix: 


Ai 


Bij  if  >  ^98  ({  |  }l<i',j'<10o)  ) 

0  otherwise, 


i,j  not  covered  in  (7.9), 


where 

{B^} <ioo  im~  Laplace(0,  (Sa/^OO)-1) 

and  g98(-)  denotes  the  98-th  percentile.  Figure  7.1(a)  shows  the  heat  map  of  the  actual  A  we  ob¬ 
tain  by  the  above  sampling  scheme,  showing  clearly  four  row-clusters  and  three  column-clusters. 
This  transition  matrix  has  the  largest  eigenvalue  modulus  of  0.9280,  constituting  a  stable  VAR 
model. 

We  then  sample  10  independent  time  series  of  50  time  steps  from  the  VAR  model  (7.1),  with 
noise  variance  a2  =  5.  We  initialize  each  time  series  with  an  independent  sample  drawn  from  the 
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Table  7.1:  Model  estimation  error  on  simulated  data 


Normalized  matrix  error 

Signed-support  error 

U 

0.3133+0.0003 

0.3012+0.0008 

Biclus  EM 

0.2419+0.0003 

0.0662+0.0012 

stationary  distribution  of  (7.1),  whose  correlation  matrix  is  shown  in  Figure  7.1(b),  suggesting 
that  clustering  based  on  correlations  among  time  series  may  not  recover  the  bi-cluster  structure 
in  Figure  7.1(a). 

To  compare  the  proposed  method  with  the  two  baselines  described  in  the  beginning  of  Section 
7.4,  we  repeat  the  following  experiment  20  times:  a  random  subset  of  two  time  series  are  treated 
as  testing  data,  while  the  other  eight  time  series  are  used  as  training  data.  For  Li  linear  regression 
(7.8)  we  randomly  hold  out  two  time  series  from  the  training  data  as  a  validation  set  for  choosing 
the  best  regularization  parameter  A  from  {2~2, 2_1, . . . ,  210}  and  weight-adaption  parameter  7 
from  {0,  2-2, 2~x, . . . ,  22},  with  which  the  final  ALl  is  estimated  from  all  the  training  data.  To 
bi-cluster  ALl,  we  consider  the  following: 

•  L 1  - 1  Biclus:  run  the  sampling  procedure  in  Section  7.3.2  on  ALl. 

•  Refit+Biclus:  refit  the  non-zero  entries  of  ALl  using  least-square,  and  run  the  sampling 
procedure  in  Section  7.3.2. 

•  hi  row-clus  (col-clus):  construct  similarity  matrices 

+  :=  E  Af.'Pf'l.  S°j  ■■=  E  l+1ll+‘l,  1 

l<s<p  1 <s<p 

Then  run  normalized  spectral  clustering  [Ng  et  al.,  2001]  on  Su  and  Sv,  with  the  number 
of  clusters  set  to  4  for  rows  and  3  for  columns,  respectively. 

For  the  second  baseline,  Bayesian  Hierarchical  Clustering  and  Gaussian  processes  (GPs),  we  use 
the  R  package  BFIC  (version  1.8.0)  with  the  squared-exponential  covariance  for  Gaussian  pro¬ 
cesses,  as  suggested  by  the  author  of  the  package.  Following  Cooke  et  al.  [2011]  we  normalize 
each  time  series  to  have  mean  0  and  standard  deviation  1.  The  package  can  be  configured  to 
use  replicate  information  (multiple  series)  or  not,  and  we  experiment  with  both  settings,  abbrevi¬ 
ated  as  BFIC-SE  reps  and  BFIC-SE,  respectively.  In  both  settings  we  give  the  BFIC  package  the 
mean  of  the  eight  training  series  as  input,  but  additionally  supply  BHC-SE  reps  a  noise  variance 
estimated  from  multiple  training  series  to  aid  GP  modeling. 

In  our  proposed  method,  several  hyper-parameters  need  to  be  specified.  For  the  stick-breaking 
parameters  au  and  av,  we  find  that  values  in  a  reasonable  range  often  lead  to  similar  posterior 
inferences,  and  simply  set  both  to  be  1.5.  We  set  the  noise  variance  prior  parameters  in  (7.7) 
to  be  a  =  9  and  (5  =  10.  For  the  two  parameters  in  the  Gamma  prior  (7.2),  we  set  h  =  2  and 
c  =  \/2j>  =  V200  to  bias  the  transition  matrices  sampled  from  the  Laplace  prior  (7.3)  towards 
being  stable.  Another  set  of  inputs  to  Algorithm  7.1  are  the  initial  values,  which  we  set  as  fol¬ 
lows:  A =  0,  =  1,  (<r^)2  =  1,  and  ZA0)  =  (h  —  1  )c  =  \/200.  We  run  Algorithm 

7.1  and  the  sampling  procedures  for  L\  i  Biclus  and  Refit+Biclus  for  2,500  iterations,  and  take 
samples  in  every  10  iterations  starting  from  the  1, 501-st  iteration,  at  which  the  sampling  algo¬ 
rithms  have  mixed  quite  well,  to  compute  point  estimates  for  A,  u  and  v  as  described  in  Sections 


97 


1 

(ft 

u> 


u 

5 

s  0.6 

o 

4- 

X 

a) 

1 04 
TJ 
C 
(0 

|  °2 

(ft 

3 

<  o 


Figure  7.3:  Adjusted  Rand  index  on  simulated  data 

7.3.1  and  7.3.2. 

Figure  2  shows  the  squared  prediction  errors  of  L\  linear  regression  (L\  )  and  the  proposed 
method  with  a  final  sample  EM  step  (Biclus  EM)  for  various  prediction  horizons  up  to  10.  Pre¬ 
dictions  errors  for  longer  horizons  are  close  to  those  by  predicting  the  mean  of  the  series,  which 
is  zero  under  our  stable  VAR  model,  and  are  not  reported  here.  Biclus  EM  slightly  outperforms 
Li,  and  paired  t  tests  show  that  the  improvements  for  all  10  horizons  are  significant  at  a  p- value 
<  0.01.  This  suggests  that  when  the  underlying  VAR  model  does  have  a  bi-clustering  struc¬ 
ture,  our  proposed  method  can  improve  the  prediction  performance  over  adaptive  L  \  regression, 
though  by  a  small  margin. 

Another  way  to  compare  L1  and  Biclus  EM  is  through  model  estimation  error,  and  we  report 
in  Table  7.1  these  two  types  of  error: 

Normalized  matrix  error.  ||  A  —  A||^/||A||^, 

Signed-support  error.  4*-  Si<iy<pI(sign(Aj)  ^  Sign(Aii)). 

Clearly,  Biclus  EM  performs  much  better  than  Li  in  recovering  the  underlying  model,  and  in 
particular  achieves  a  huge  gain  in  signed  support  error,  thanks  to  its  use  of  bi-clustered  inverse 
scale  parameters  L. 

Perhaps  the  most  interesting  is  the  clustering  quality,  which  we  evaluate  by  the  Adjusted 
Rand  Index  [Hubert  and  Arabie,  1985],  a  common  measure  of  similarity  between  two  cluster¬ 
ings  based  on  co-occurrences  of  object  pairs  across  clusterings,  with  correction  for  chance  ef¬ 
fects.  An  adjusted  Rand  index  takes  the  maximum  value  of  1  only  when  the  two  clusterings 
are  identical  (modulo  label  permutation),  and  is  close  to  0  when  the  agreement  between  the 
two  clusterings  could  have  resulted  from  two  random  clusterings.  Figure  7.3  shows  the  cluster¬ 
ing  performances  of  different  methods.  The  proposed  method,  labeled  as  Biclus,  outperforms 
all  alternatives  greatly  and  always  recovers  the  correct  row  and  column  clusterings.  The  two- 
stage  baseline  methods  Li+Biclus,  Refit+Biclus,  and  L\  row-clus  (col-clus)  make  a  significant 
amount  of  errors,  but  still  recover  moderately  accurate  clusterings.  In  contrast,  the  clusterings 
by  the  time-series  similarity  based  methods,  BFIC-SE  and  BFIC-SE  reps,  are  barely  better  than 
random  clusterings.  To  explain  this,  we  first  point  out  that  BFIC-SE  and  BHC-SE  reps  are 


(a)  Row  clusters 
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Figure  7.4:  Heat  maps  of  the  Biclus-EM  estimate  of  A  and  the  inverse  scale  parameters  L  aver¬ 
aged  over  posterior  samples;  rows  and  columns  permuted  according  to  clusters. 


designed  to  model  time  series  as  noisy  observations  of  deterministic,  time-dependent  “trends” 
or  “curves”  and  to  group  similar  curves  together,  but  the  time  series  generated  from  our  stable 
VAR  model  all  have  zero  expectation  at  all  time  points  (not  just  across  time).  As  a  result,  clus¬ 
tering  based  on  similar  trends  may  just  be  fitting  noise  in  our  simulated  series.  These  results  on 
clustering  quality  suggest  that  when  the  underlying  cluster  structure  stems  from  (Granger)  causal 
relations,  clustering  methods  based  on  series  similarity  may  give  irrelevant  results,  and  we  really 
need  methods  that  explicitly  take  into  account  dynamic  interaction  patterns,  such  as  the  one  we 
propose  here. 

7.4.2  Modeling  T-cell  activation  gene  expression  time  series 

We  analyze  a  gene  expression  time  series  dataset3  collected  by  Rangel  et  al.  [2004]  from  a  T-cell 
activation  experiment.  To  facilitate  the  analysis,  they  pre-processed  the  raw  data  to  obtain  44 
replicates  of  58  gene  time  series  across  10  unevenly-spaced  time  points.  Recently  Cooke  et  al. 
[2011]  carried  out  clustering  analysis  of  these  time  series  data,  with  their  proposed  Gaussian 
process  (GP)  based  Bayesian  Hierarchical  Clustering  (BHC)  and  quite  a  few  other  state-of-the 
art  time  series  clustering  methods.  BHC,  aided  by  GP  with  a  cubic  spline  covariance  func¬ 
tion,  gave  the  best  clustering  result  as  measured  by  the  Biological  Homogeneity  Index  (BHI) 
[Datta  and  Datta,  2006],  which  scores  a  gene  cluster  based  on  its  number  of  gene  pairs  that  share 
certain  biological  annotations  (Gene  Ontology  terms). 

To  apply  our  proposed  method,  we  first  normalize  each  time  series  to  have  mean  0  and  stan¬ 
dard  deviation  1  across  both  time  points  and  replicates,  and  then  “de-trend”  the  series  by  taking 
the  first  order  difference,  resulting  in  44  replicates  of  58  time  series  of  gene  expression  dif¬ 
ferences  across  9  time  points.  We  run  Algorithm  7.1  on  this  de -trended  dataset,  with  all  the 
hyper-parameters  and  initial  values  set  in  the  same  way  as  in  our  simulation  study.  In  3,000 
iterations  the  algorithm  mixes  reasonably  well;  we  let  it  run  for  another  2,000  iterations  and  take 

3  Available  in  the  R  package  longitudinal. 
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Figure  7.5:  BHI.  Green  dots  show  BHIs  of  different  methods;  blue  boxes  are  BHIs  obtained  by 
200  random  permutations  of  cluster  labels  by  those  methods;  green  boxes  are  BHIs  computed 
on  posterior  cluster  indicator  samples  from  the  proposed  method.  In  parentheses  are  numbers  of 
clusters  given  by  different  methods. 


samples  from  every  10  iterations,  resulting  in  200  posterior  samples,  to  compute  point  estimates 
for  A,  cluster  indicators  u  and  v  as  described  in  Sections  7.3.1  and  7.3.2.  Figures  7.4(a)  and 
7.4(b)  show  the  heat  maps  of  the  transition  matrix  point  estimate  and  the  inverse  scale  param¬ 
eters  Aj/s  averaged  over  the  posterior  samples,  with  rows  and  columns  permuted  according  to 
clusters,  revealing  a  quite  clear  bi-clustering  structure. 

For  competing  methods,  we  use  the  GP  based  Bayesian  Hierarchical  Clustering  (BHC)  by 
Cooke  et  al.  [2011],  with  two  GP  covariance  functions:  cubic  spline  (BHC-C)  and  squared- 
exponential  (BHC-SE)4.  We  also  apply  the  two-stage  method  Li+Biclus  described  in  our  sim¬ 
ulation  study,  but  its  posterior  samples  give  an  average  of  15  clusters,  which  is  much  more  than 
the  number  of  clusters,  around  4,  from  the  spectral  analysis  described  in  Section  7.3.2,  suggest¬ 
ing  a  high  level  of  uncertainty  in  their  posterior  inferences  about  cluster  indicators.  We  thus  do 
not  report  their  results  here.  The  other  two  simple  baselines  are:  Corr,  standing  for  normalized 
spectral  clustering  on  the  correlation  matrix  of  the  58  time  series  averaged  over  all  44  replicates, 
the  number  of  clusters  2  determined  by  the  spectral  gap,  and  All-in-one,  which  simply  puts  all 
genes  in  one  cluster. 

Figure  7.5  shows  the  BHI  scores5  given  by  different  methods,  and  higher-values  indicate  bet¬ 
tering  clusterings.  Biclus  row  and  Biclus  col  respectively  denote  the  row  and  column  clusterings 
given  by  our  method.  To  measure  the  significance  of  the  clusterings,  we  report  BHI  scores  com¬ 
puted  on  200  random  permutations  of  the  cluster  labels  given  by  each  method.  For  Biclus  row 
and  Biclus  col,  we  also  report  the  scores  computed  on  the  200  posterior  samples.  All-in-one  has 
a  BHI  score  around  0.63,  suggesting  that  nearly  two-thirds  of  all  gene  pairs  share  some  biolog- 

4We  did  not  report  results  obtained  using  replicate  information  because  they  are  not  better.  Cluster  labels  are 
from  http  :  / / www . biomedcentral .  com/ 1471-2105/12/399/ additional. 

5We  compute  BHIs  by  the  BHI  function  in  the  R  package  cl  Valid  (version  0.6-4)  [Brocket  al.,  2008]  and  the 
database  hgul  33plus2.db  (version  2.6.3),  following  Cooke  et  al.  [2011], 
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Figure  7.6:  Gene  functional  profiling  of  the  large  BHC-C  cluster 


ical  annotations.  Corr  puts  genes  into  two  nearly  equal-sized  clusters  (28  and  30),  but  does  not 
increase  the  BHI  score  much.  In  contrast,  BHC-C  and  Biclus  row  achieve  substantially  higher 
scores,  and  both  are  significantly  better  than  those  by  random  permutations,  showing  that  the 
improvements  are  much  more  likely  due  to  the  methods  rather  than  varying  numbers  or  sizes  of 
clusters.  We  also  note  that  even  though  Corr  and  BHC-C  both  give  two  clusters,  the  two  BHC-C 
clusters  have  very  different  sizes  (48  and  10),  which  cause  a  larger  variance  in  their  BHI  distri¬ 
bution  under  random  label  permutations.  Lastly,  BHC-SE  and  Biclus  col  give  lower  scores  that 
are  not  significantly  better  than  random  permutations.  One  possible  explanation  for  the  differ¬ 
ence  in  scores  by  Biclus  row  and  Biclus  col  is  that  the  former  bases  itself  on  how  genes  affect 
one  another  while  the  latter  on  how  genes  are  affected  by  others,  and  Gene  Ontology  terms,  the 
biological  annotations  underlying  the  BHI  function,  describe  more  about  genes’  active  roles  or 
molecular  functions  in  various  biological  processes  than  what  influence  genes. 
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Figure  7.7:  Gene  functional  profiling  of  two  large  row  clusters  by  the  proposed  method 

Finally,  to  gain  more  understanding  on  the  clusters  by  BHC-C  and  Biclus  row,  we  conduct 
gene  function  profiling  with  the  web-based  tool  g:Profiler  [Reimand  et  al.,  2011],  which  per¬ 
forms  “statistical  enrichment  analysis  to  provide  interpretation  to  user-defined  gene  lists.”  We 
select  the  following  three  options:  Significant  only ,  Hierarchical  sorting,  and  No  electronic  GO 
annotations.  For  BHC-C,  4  out  of  10  genes  in  the  small  cluster  are  found  to  be  associated  with 
the  KEGG  cell-cycle  pathway  (04110),  but  the  other  6  genes  are  not  mapped  to  collectively 
meaningful  annotations.  The  profiling  results  of  the  large  BHC-C  cluster  with  48  genes  are  in 
Figure  7.6;  for  better  visibility  we  show  only  the  Gene  Ontology  (GO)  terms  and  high-light  sim¬ 
ilar  terms  with  red  rectangles  and  tags.  About  a  half  of  the  terms  are  related  to  cell  death  and 
immune  response,  and  the  other  half  are  lower-level  descriptions  involving,  for  example,  signal- 
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Table  7.2:  Contingency  table  of  row  and  column  clusterings 
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ing  pathways.  For  Biclus  row,  we  report  the  profiling  results  of  only  the  two  larger  clusters  (the 
second  and  the  third)  in  Figure  7.7,  because  the  two  smaller  clusters,  each  containing  5  genes,  are 
not  mapped  to  collectively  meaningful  GO  terms.  Interestingly,  the  two  large  Biclus  row  clusters 
are  associated  with  T-cell  activation  and  immune  response  respectively,  and  together  they  cover 
41  of  the  48  genes  in  the  large  BFIC-C  cluster.  This  suggests  that  our  method  roughly  splits  the 
large  BFIC-C  cluster  into  two  smaller  ones,  each  being  mapped  to  a  more  focused  set  of  biolog¬ 
ical  annotations.  Moreover,  these  Biclus  profiling  results,  the  heat  map  (Figure  7.4(a)),  and  the 
contingency  table  between  the  row  and  column  clusters  (Table  7.2)  altogether  constitute  a  nice 
resonance  with  the  fact  that  T-cell  activation  results  from,  rather  than  leading  to,  the  emergence 
of  immune  responses. 


7.5  Conclusion 

We  develop  a  nonparametric  Bayesian  method  to  simultaneously  infer  sparse  VAR  models  and 
bi-clusterings  from  multivariate  time  series  data,  and  demonstrate  its  effectiveness  via  simula¬ 
tions  and  experiments  on  real  T-cell  activation  gene  expression  time  series,  on  which  the  pro¬ 
posed  method  finds  a  more  biologically  interpretable  clustering  than  those  by  some  state-of-the 
art  methods.  Future  directions  include  modeling  signs  of  transition  matrix  entries,  generaliza¬ 
tions  to  higher-order  VAR  models,  and  applications  to  other  real  time  series. 


103 


104 


Chapter  8 

Conclusions  and  Future  Directions 


Motivated  by  the  difficulties  in  collecting  reliable  time  series  data  in  a  variety  of  modem  dynamic 
modeling  tasks,  we  study  in  this  thesis  the  problem  of  learning  dynamic  models  from  data  that 
lack  time  information  but  are  easier  to  obtain.  We  observe  that  such  non-sequence  data  can  often 
be  modeled  as  independent  samples  drawn  from  multiple,  independent  executions  of  the  under¬ 
lying  dynamic  process.  Based  on  this  assumption,  we  propose  and  study  learning  algorithms  for 
several  widely-used  dynamic  models,  including  fully  observable  linear  and  non-linear  models, 
and  Hidden  Markov  Models. 

For  fully  observable  models,  we  first  point  out  some  model  identifiability  issues  in  learning 
from  non-sequence  data.  Then  we  develop  several  EM-type  learning  algorithms  based  on  max¬ 
imizing  approximate  likelihood,  and  for  the  case  where  a  small  amount  of  sequence  data  are 
available,  we  propose  a  novel  penalized  least  square  approach  that  uses  both  sequence  and  non¬ 
sequence  data.  Empirical  evaluation  on  synthetic  data  and  several  real  data  sets,  including  gene 
expression  and  cell  image  time  series,  demonstrates  that  our  proposed  methods  can  leam  reason¬ 
ably  accurate  dynamic  models  with  little  or  even  no  time  information.  However,  we  also  observe 
several  failure  modes  that  are  hard  to  overcome  without  extra  information  or  assumption.  This 
suggests  that  for  the  proposed  methods  to  make  impact  in  real  applications,  they  should  incorpo¬ 
rate  as  much  expert  domain  knowledge  as  possible.  For  example,  knowing  how  the  variables  in 
the  dynamic  model  might  interact  with  one  another  can  help  the  design  of  a  better  regularization 
scheme.  This  motivates  us  to  develop  methods  for  learning  bi-clustered  vector  autoregressive 
models.  Or,  in  some  applications  there  might  be  partial  ordering  information  about  the  data, 
which  can  provide  constraints  in  our  EM-type  algorithms. 

For  Hidden  Markov  Models,  we  build  on  recent  advances  in  spectral  learning  of  latent  vari¬ 
able  models  and  propose  tensor  factorization  based  methods  that  guarantee  consistent  parameter 
estimation,  under  reasonable  assumptions  on  the  underlying  HMM  and  the  generative  process 
of  non-sequence  data.  These  assumptions  are  inspired  by  spectral  learning  of  topic  models,  but 
have  a  few  key  differences,  such  as  conditions  on  the  Dirichlet  prior  for  the  initial  state  distribu¬ 
tion  and  modeling  missing  times  as  geometric  random  variables,  that  are  specific  to  the  HMM 
setting.  Although  these  generative  assumptions  may  not  hold  in  observational  data,  they  may  be 
fairly  easy  to  implement  in  some  scientific  experiments.  We  also  consider  the  situation  when  lit¬ 
tle  sequence  data  are  available,  and  propose  a  spectral  algorithm  using  both  types  of  data,  which 
outperforms  sequence-only  learning  algorithms. 
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Going  forward,  one  interesting  direction  is  to  investigate  whether  spectral  methods  can  be 
used  to  consistently  leam  first-order  observable  models  from  non-sequence  data,  and  under  what 
conditions.  As  demonstrated  in  Chapter  5,  it  is  primarily  the  discreteness,  or  more  generally,  non- 
Gaussianity  of  the  hidden  state  space  dynamics  that  leads  to  nice  tensor  structures  in  observable 
moments  and  easy  characterization  of  assumptions  ensuring  unique  parameter  estimation.  There¬ 
fore,  in  the  case  of  first-order  models  with  continuous  observations,  we  expect  that  non-Gaussian 
initial  distribution  is  needed  for  consistent  spectral  learning  from  non-sequence  data.  Moreover, 
it  is  likely  that  extra  assumptions  on  the  initial  distribution,  such  as  distinct  variances  or  means  in 
different  dimensions,  are  required  to  eliminate  the  invariance  to  parameter  permutation  inherent 
in  spectral  learning. 

Another  important  future  direction  is  to  make  impact  in  real  applications  with  our  proposed 
methods.  In  order  for  that  to  happen,  we  expect  to  see  various  interesting  extensions  or  modifica¬ 
tions  to  our  approaches  that  are  tailored  to  the  application  of  interest.  In  particular,  our  proposed 
modeling  assumption  of  non-sequence  data  has  several  components  that  can  be  replaced  to  bet¬ 
ter  suit  different  applications,  such  as  the  distributional  assumption  on  the  missing  times  and  the 
observational  noise  model.  More  broadly,  our  work  has  demonstrated  the  possibility  of  using 
cross  sectional  data  to  aid  longitudinal  study.  As  mentioned  in  the  very  beginning  of  the  thesis, 
it  is  common  in  medical  and  social  sciences  that  cross  sectional  data  are  much  easier  to  collect 
than  longitudinal  data,  and  yet  a  lot  of  cross  sectional  data  were  actually  collected  under  some 
longitudinal  effect.  With  advances  in  large-scale  sensing  technology,  this  situation  will  likely 
become  more  prevalent.  We  think  there  are  several  possibilities  for  our  work  to  make  concrete 
contributions.  For  example,  at  the  initial  stage  of  longitudinal  studies,  researchers  often  have 
to  pose  reasonable  hypotheses  to  guide  the  design  of  experiments  or  data  collection  protocols. 
However,  even  forming  good  hypotheses  may  be  difficult  when  the  subject  matter  involves  a 
complicated  system.  In  this  situation,  our  methods  may  serve  as  a  good  hypothesis  generator,  us¬ 
ing  cross  sectional  data  that  are  available  to  produce  possible  models.  Or,  sometimes  researchers 
may  want  some  immediate,  preliminary  assessment  even  though  the  longitudinal  study  is  still 
ongoing  and  only  produced  limited  data.  If  there  are  abundant  cross  sectional  data  in  the  same 
domain,  our  methods  of  combining  sequence  and  non-sequence  data  may  be  used  to  provide  a 
reasonable  estimate  of  the  dynamic  model  under  study. 

In  conclusion,  our  work  demonstrates  the  possibility  of  learning  dynamic  models  from  data 
that  lack  time  information,  and  we  hope  it  stimulates  more  research  in  making  better  use  of  the 
large  amount  of  cross  sectional  data  brought  by  modem  sensing  technology. 
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Appendix  A 

A  Variational  EM  algorithm  for  Learning 
HMMs  from  Non-sequence  Data 
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Based  on  the  generative  process  in  Section  5.2.2,  we  derive  a  variational  EM  algorithm  for 
parameter  learning  assuming  the  observation  noise  follows  a  spherical  Gaussian  with  variance 
a2.  The  full  joint  probability  of  data  and  latent  variables  takes  the  following  form: 
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in  which  we  use  super-script  as  set  indices  and  sub-scripts  as  data  indices  within  a  set  wherever 
appropriate.  The  goal  is  to  maximize  the  marginal  likelihood  of  the  data  w.r.t  to  the  parameters. 
We  begin  by  marginalizing  over  the  latent  times  {t\}\ 
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where  T  denotes  the  expected  transition  probability  matrix.  As  in  the  tensor  factorization  ap¬ 
proach,  we  recover  P  and  r  from  the  estimated  T  using  the  proposed  search  heuristics.  Because 
the  posterior  distribution  of  the  remaining  latent  variables  still  leads  to  an  intractable  E  step,  we 
employ  the  following  factorized  approximation: 
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where  tp(-)  is  the  digamma  function.  The  variational  EM  algorithm  then  amounts  to  maximizing 
g  iteratively,  alternating  between  the  following  two  steps  until  convergence: 

Variational  E-step 

Holding  the  model  parameters  fixed,  repeat  the  updates 

($iV  °c  \Ui,  a2 ^TuiexptyiPl,) -$($)), 

m,  =  £«>,+<*,, 

i,l' 

until  convergence. 

M-step 

Holding  the  variational  parameteres  {<f^}  and  {/3  •}  fixed,  update  model  parameters: 
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The  update  for  a  is  a  convex  optimization  problem  whose  inverse  Hessian  can  be  computed  in 
linear  time  Blei  et  al.  [2003]. 

Finally,  we  have  to  match  the  columns  of  U  with  the  columns  of  T.  Note  that  the  updates 
imply  that  the  columns  of  U  are  aligned  with  the  rows  of  T,  so  it  suffices  to  match  T’s  rows 
with  its  columns.  Using  the  assumptions  that  cx. /a o  =  7r  and  i t*  ^  TTj  V  i  ^  j,  we  recover  the 
matching  by  sorting  a/cto  and  Ta/a 0. 
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Appendix  B 

Proofs  of  Theorems  in  Chapter  5 
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B.l  Tensor  structure  in  low-order  moments 


Here  we  give  proofs  of  theorems  on  tensor  structures  in  low-order  moments  of  observable  data. 
The  proofs  make  repeated  use  the  following  facts: 

•  Tn  =  n,  i.e.,  the  stationary  state  distribution  is  invariant  under  the  expected  transition 
probability  matrix  T. 

•  The  missing  time  steps  V s  are  independent  of  everything  else. 

•  Conditioned  on  the  initial  state  distribution  7r0,  i.e.,  within  the  same  set  of  data,  the  obser¬ 
vations  {x,  }  are  mutually  independent,  so  are  the  hidden  states  { h, }  and  the  initial  states 

{*}• 

•  The  low-order  moments  of  the  Dirichlet  distribution  have  a  special  form  (c.f.  Appendix 
B.l  of  Anandkumar  et  al.  [2013]),  which  leads  to  the  desired  symmetric  tensor  structure. 


B.1.1  Proof  of  Theorem  2 
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T  f  diag(?r)  +  a07T7rT\  T-J 

\  Q?o  +  1  +  1  / 

rdiag(7r)PT  ao7r7rT 

Q?o  +  1  OL  0  +  1 

E[xi  ®  x2  ®  x3] 

E7roE[(Pils1)  ®  (Pt2s2)  ®  (Pt3 s3)  I  tt0] 

E^  [(Ttt0)  O  (Ttto)  <8>  (Ptto)] 

Ei  2^ \Ti  ®Ti®Ti  0% 7T  <g>7T<gl7T 

(ao  +  2)(a0  +  1)  (po  +  2)(ao  +  1) 

«0  (Eij  (Ti  ®Ti®  Tj  +Ti®  Ti  ®  Ti  +  Tj  ®Ti®  Ti)  TTiTT^j 
(tto  +  2)(o:o  +  1) 

Ei  2 7TjT)03  -  2a?o7r®3  ^  a0(7r  ®3  C2  +  n  ®2  C2  +  7r  C2) 

(cio  +  2)(ao  +  1)  oq  +  2 


(B.l) 

(B.2) 


(B.3) 


(B.4) 
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We  obtain  (B.l)  and  (B.3)  by  using  the  expressions  of  Dirichlet  moments  derived  in  Appendix 
B.l  of  Anandkumaret  al.  [2013].  Re-arranging  (B.2)  and  (B.4)  leads  to  the  adjusted  moments 
M2  and  M3. 


B.1.2  Proof  of  Theorem  4 


K  :=  E[Xl] 

=  E^hx  +  ei] 

=  f/EfF^Si] 

=  UTE[7r0] 

=  U  7T. 

V2  :  =  E[xxx7] 

=  E[(£/hi  +  e1)(C/h1  +  ei)T] 

=  Eph^J^}  +a2I 
=  f/E[diag(h1)]FT  +  a2I 
=  f/E[diag(Ptls1)]f/T  +  a2I 
=  f/E[diag(T7r0)]f/T  +  a2I 
=  Uti\ag(n)UT  +  cr2/. 

V3  :  =  E[xi  8  Xi  8  xi] 

=  E[(t/hx  +  6i)  8  (Phi  +  Ci)  8)  (Phi  +  ei)] 

=  E[(f/h!)®3]  +  E[(f/hi)  8  61  8  d]  +  E[ei  8  {Uh^  8  d]  +  E[d  8  d  8  (C/h,)] 
=  ^2  niUi  8  t/j  8  Ui  +  Vi  8i  (a2/)  +  Vi  82  (o'2/)  +  V  83  (p2/), 


which  relies  on  the  assumption  of  zero  skewness  E^ex)^]  =  0, 1  <  d  <  m. 

C2  :=  E[xixJ] 

=  E[(f/h1  +  e1)(f/h2  +  e2)T] 

=  Eff/hxhjF7]  (B.5) 

=  UE[Ptls1s^(Pt2)T]UT 
=  UTE[7707Vq]TTUT 

f/rdiag(7r)(FT)T  ,  aoViV^ 

=  - — ; - 1 - — r-  (B.eo 

C3  :=  E[xi8x28x3] 

=  E[(f/hi  +  ei)  8  (Uh.2  +  e2)  8  (t/h3  +  e3)] 

=  E[(C/hi)  8  (f/h2)  8  (Ph3)]  (B.7) 

=  E[(f/Ptls1)  8  (UPt2s2)  8  (£/Pi3s3)] 

=  E[(PT7r0)  8  (PTttq)  8  (PTttq)] 
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Ej 27 Tj(UT)f  ~  2 alv^  |  q0(K  ®3  C2  +  Vj  ®2  C2  +  Vj  <8>i  C2) 

(«o  +  2)(«o  T  1)  ao  +  2 

Note  that  due  to  the  independence  assumption,  there  are  no  noise-related  terms  in  (B.5)  and  (B.7), 
indicating  that  C-2  and  C:>  are  unaffected  by  the  noise  distribution.  And  again,  (B.6)  and  (B.8) 
are  established  with  the  expressions  of  Dirichlet  moments  in  Appendix  B.l  of  Anandkumar  et  al. 
[2013].  As  in  Appendix  B.1.1,  M2,  M3,  M'2  and  M3  result  from  adjusting  the  raw  moments. 

B.2  Proof  of  Theorem  3 

We  first  prove  the  following  lemma: 

Lemma  1.  IfP(r)  :=  ( rl  +  (1  —  r)T*)_1T*  exists  and  is  a  stochastic  matrix  for  some  r  G  (0, 1], 
then  P(r')  exists  and  is  a  stochastic  matrix  for  all  r'  G  [r,  1]. 

Proof  Since  P(r)  exists  we  can  write  T*  =  rP(r)(I  —  (1  —  r)P(r))-1.  By  assumption  P*  is 
invertible,  so  T*  is  invertible.  We  then  have 

r\T*)~l  +  (1  -  r')I  =  -(P(r)-1  -  (1  -  r)I)  +  (1  -  r')I  =  -P(r)~\l  -  (1  -  r/r')P(r)), 
which  is  invertible  for  all  r'  G  [r,  1],  Therefore,  we  can  write 

P(r')  =  (- r\T *Yl  +  (1  -  r')I)~l  =  ^ -P(r)(I  -  (1  -  r/r')P(r))~l  =  Ef[P(r)], 

where  t  ~  Geometric(r/r,),  showing  that  P(r')  is  a  stochastic  matrix.  □ 

To  prove  Theorem  3  we  begin  by  noting  that  S  contains  all  values  of  r  for  which  rI+(l—r)T* 
is  singular.  Therefore,  P(r)  is  well-defined  and  invertible  for  r  G  (0, 1]  \  S.  From  the  identity 
T*n*  =  7r*  =  (r/  +  (1  —  r)T*)n*  we  have  P(r) 7r*  =  7r*,  r  f  S.  Similarly,  the  identity 
1  tT*  =  1T  =  lT(r/  +  (l-  r)T*)  and  the  fact  that  (rl +  (l-r)T*y1T*  =  T*(rl  +  (l-r)T*)~1 
imply  that  lTP(r)  =  1T,  r  f  S.  It  is  easy  to  verify  P(r*)  =  P*  by  plugging  in  the  definition 
of  T*.  Lemma  1  then  implies  that  max(5)  <  r*  and  that  P(r')  is  a  stochastic  matrix  for  r'  >  r*. 
To  prove  the  last  statement  of  the  theorem  we  rewrite  P(r)  by  plugging  in  the  definition  of  T*\ 

P(r)  =  —  (I  -  (1  -  r*/r)P*Yl  P* 
r 

and  consider  its  first-order  derivative  w.r.t.  r: 

dP(r)  =  rr  j  |  /  r  \  a~2(I-P*)P* 

which  exists  for  r  G  (0, 1]  \  S.  By  assumption  we  have  Pf  =  0,  and  by  ergodicity  of  P*  we  can 
assume  (P*)?-  >  0  (otherwise  there  exists  k  f  j  such  that  P*k  =  0  and  (. P*)2ik  >  0).  Then  we 
have 

gjVk  =  (f%  >  0 

dr  r=r*  r* 

implying  that  there  exists  c  >  0  such  that  for  r  G  [r*  —  c,  r*) ,  P{r)ij  <  Pf  =  0.  This  and  Lemma 
1  then  imply  the  last  statement  of  the  theorem. 
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B.3  Sample  Complexity  Analysis 

The  analyses  here  mostly  follow  those  in  Anandkumaret  al.  [2013].  Let  O  denote  the  observa¬ 
tion  matrix,  which  can  be  the  T  matrix  in  First-order  Markov  models,  the  U  matrix  or  the  product 
UT  in  Hidden  Markov  Models.  Define 


O  :=  OdiagQ^v^  ••• 

k 

M2  :=  0diag(7r)0T  =  00T  and  M3  :=  ^  'KjOj  ®  O,  ®  Oj. 

i= 1 

Let  7 rmin  :=  min,  tt,.  We  have 


<Tk(0)y/ vrmin  <  o-fc(O), 

o-i (O)  <  <T\{0), 

where  ( • )  denotes  the  yth  largest  singular  value. 

Denote  by  ||  •  ||  the  spectral  norm  of  a  matrix  or  the  operator  norm  of  a  symmetric  third-order 
tensor  induced  by  the  vector  2-norm: 


Suppose 


||M||  :=  sup  \M(d.e,6)\. 
Il»l|2  =  l 


\\m2  —  M2\\  —  E2, 
||M3  —  M3||  <  e3, 


for  some  E2  and  E3  to  be  determined. 


B.3.1  Perturbation  Lemmas 


Let  M2  k  be  the  best  rank  k  approximation  to  M2  in  terms  of  the  matrix  2-norm.  According  to 
Algorithm  5.1,  we  have 


WT  M2kW  =  I. 


Let 


WtM2W  =  adat 


be  an  SVD  of  WT M2W,  where  A  £  7 Zkxk.  Define 


W  :=  WAD~1/2At 


and  notice  that 

WtM2W  =  AD~1/2AtWtM2WAD~1/2At  =  I. 
Let  Q  :=  WT0  and  Q  :=  WT0. 
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Lemma  2.  (Lemma  C.l  of  Anandkumar  et  al.  [2013])  Let  II  w  be  the  orthogonal  projection 
onto  the  range  of  W  and  II  he  the  orthogonal  projection  onto  the  range  of  O.  Suppose  E->  < 
2)/2.  We  have  the  following: 


IIQII 

= 

1, 

IMI 

< 

2, 

iril 

< 

2 

crk{OY 

ll^ll 

< 

2(7i  (0), 

ll^ll 

< 

3(7!  (0), 

WQ-QW 

< 

AE'2 

\\w*-wm 

< 

6cr1(0)E 

CTk(0)'2 

lin-n^n 

< 

4E2 

vk(oy 


Lemma  3.  Weyl’s  Theorem.  ( Theorem  4.11,  p.204  in  Stewart  and  Sun  [1990]).  Let  A,  E  e 
7 Zmxn  with  in  >  n  be  given.  Then 


max  \afA-\-  E) 

l<i<n 


ai(A)\  <  \\E\\. 


B.3.2  Reconstruction  Accuracy 

Throughout  this  section  we  assume  that  the  number  of  iterations  N  and  L  for  Algorithm  5.2 
satisfy  the  conditions  in  Theorem  1. 

Lemma  4.  Suppose  max(f?2,  E3)  <  crk(M2)/2.  For  any  77  e  (0,1),  with  probability  at  least 
1  —  rj  the  following  holds: 


\\0  -  (WT)fVA||  < 


max(cri(0),  1) 
^minminK  (O)2,1) 


ma x(E2,  E3) 


for  some  constant  c  >  0. 


Proof.  By  Theorem  1,  the  following  hold  with  probability  at  least  1  —  77: 

ii'7  -  piif  =  m-v,r  <  Vy](64i5|)/(i/yi— )2  =  »e3, 

II A||  =  max  l/y/iTi  <  max(l/0i j  +  5 E3)  <  vr"2/2  +  5 E3. 
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With  the  above  two  bounds  and  Lemma  2  we  have 


|| o  -  (w?T)tt/A||  <  ||o  -  nwo\\  +  \\nwo  -  (wTyv\\\ 

=\\uo  -  nwo\\  +  iKw^yA  -  (w?t)TvA|| 

<||n-nw||||0||  +  ||(L1At)TWA  —  (W"t)TVA||  +  H(lLt)T\/A-  (W?t)T\/A|| 

<||n  -  nw\\  +  ||w^||||v’||||A  -  a||  +  \\(w])tva  -  {w^)tva\\  +  ||(wt)TyA  -  (vb?t)TwA|| 
<||n  -  nH/||  +  \\w'\\e3  +  ||wt||||v  -  F||||A||  +  -  ^IIII^IIIIAH 


- — 7AA  +  3cti(0)£^3  +  3ai(0)\\V  -  V\\F 

Vk{0)2 

24  _\  _  /.  6(7!  (0)\ 


+^ffr-vi,  +  i)|A| 


&k{oy 


-c{(^k+3h{°)E3+(4+ 


e2 

)  °k{0) 


Tin 


<c 


27  cr\(0)  10  max(cr1(0),  1) 


\fri~i nir 


^min  VkiO)2 


max(L2,  E3) 


37max(cri(0),  1)  .  . 

<c— - v  v  ;  max(L2,  E3) 


^min  min((7fc(0)2, 1) 


where  c  >  0  is  a  constant  large  enough  to  dominate  low-order  terms  like  E-2E3. 


□ 


Lemma  5.  With  a  slight  abuse  of  notation,  let  U  denote  a  column  permutation  of  the  true  U  , 
UT  denote  a  column  permutation  of  the  true  UT,  and  P  denote  a  column-and-row  permutation 
of  the  true  P,  where  the  permutations  involved  are  the  same.  Suppose 

max(|| U  -  U ||,  \\UT  -  UT\\)  <  ak(rU  +  (1  -  r)UT)/ 2. 


We  then  have 

||P  -  (rU  +  (1  -  r)UfyUT\\  <  ma.x(||C/  -  £7||,  \\UT  -  UT\\) 

<Jk{rU  +  (1  -  r)U±y 

Proof  First  notice  that 

(rU+  (1  -r)UT)\UT) 

=  ((r/  +  (1  -  r)T)TUTU (rl  +  (1  -  r)T))~\rI  +  (1  -  r)T)TUTUT 
=(r/  +  (l  -r)T)_1T  =  P. 


Then  we  have 

\\P  -  (rU  +  (1  -  r)UfyUT\\  =  \\(rU  +  (1  -  r)(UT)^UT  -  (rU  +  (1  -  r)UT^Uf\\ 

<\\{rU+  (1  -r)UT)\UT)  -  (rU  +  (1  -  r)UT)]{UT)\\  + 

|| (rU  +  (1  -  r)UT)\UT )  -  (■ rU  +  (1  -  r)UT)]UT\\ 

<\\(rU+  (1  -r)UTy  -  (rU  +  (1  -  r)UT^\\  \\UT\\  +  \\{rU  +  {l  -  r)UT^\\\\UT  -UT\\. 

(B.9) 
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By  Lemma  3  and  the  assumption  of  the  lemma,  we  have 

ak(rU  +  (1  -r)UT)/2  <  ak(rU  +  (1  -  r)UT)  <  3ak(rU  +  (1  -  r)UT)/2, 

showing  that  rank(rf/  +  (1  —  r)UT )  =  k  and 

|| (rU  +  (1  -  r)Ufy\\  =  1  /ak(rU  +  (1  -  r)UT)  <  2/ak(rU  +  (1  -  r)UT). 

Because  rank(rf/  +  (1  —  r)UT)  =  rank(rf/  +  (1  —  r)UT)  =  k.  Theorem  3.4  in  Stewart  [1977] 
indicates  that 


\\{rU  +  (1  -  r)UT)]  -  (rU  +  (1  —  r)f/T)t|| 

<v/2|| (rU  +  (1  -  r)[/T)t|| \\(rU  +  (1  -  r)t/T)t||  ||r(£7  -  U)  +  (1  -  r){UT  -  UT) || 

<  y/2{r\\U-U\\  +  {l-r)\\UT-UT\\)  <  2y/2{r\\U  -  U ||  +  (1  -  r)||[/T  -  UT\\) 

~<Tk(rU  +  (1  -  r)UT)ak(rU  +  (1  -  r)UT)  ~  <?k(rU  +  (1  -  r)UT )2 


Applying  these  bounds  to  (B.9)  then  leads  to 


\\P  -  (rU  +  (1~  r)UfyUT\\ 

^2\/2ai(UT) (r\\U  —  U\\  +  (1  —  r)\\UT  —  UT\\)  2\\UT -UT\\ 

-  o’k(rU  +  (1  —  r)UT)2  +  ak(rU  +  (1  -  r)UT) 

_r2v/2cr1(f/T)||f/  -  U\\  ((1  -  r)2v/2a1(f/T)  +  2 ak{rU  +  (1  -  r)UT )) \\UT  -  UT\\ 

~  ak(rU  +  (1  -  r)UT)2  +  ak(rU  +  (1  -  r)UT)2 


< 

< 


max(r2\/2,  (1  —  r)2y/2  +  2)<r1(f/T) 
<7k(rU  +  (1  —  r)UT)2 
6(j !  (UT) 


max(||C/  —  U\\,  \\UT  —  UT\\) 


ak{rU  +  (1  -  r)UT)2 


max(||[/  —  U\\,  \\UT  —  UT\\), 


in  which  we  use  the  fact  afUT)  >  oi(rU  +  (1  —  r)UT )  >  ak(rU  +  (1  —  r)UT).  □ 


B.3.3  Concentration  of  empirical  averages 

Lemma  6.  Let  be  N  i.i.d.  random  vectors  in  lZm.  Let  fj,  :=  E[y^] ,  E  :=  Var( y*)  and 

<r max  :=  maxdEdd.  Let/2  :=  (^yO/N.  Then 

2 

7TUT 

ProM||A-Ml|2>e)<^f. 

Proof.  This  lemma  is  a  straightforward  consequence  of  the  Markov  inequality: 

Prob(||/i  -  fjt\\2  >  e)  =  Prob(||/i  —  /x|||  >  e2) 

<  E\\\ki  -  y\\l\ 

e2 

T,dE(Pd~  Pd)2  =  Tr(E)  < 

e2  Ne2  ~  Ne2  ' 

□ 
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Lemma  7.  Let  Vi,V2,V3,  C2 ,  C3  denote  averages  of  N  independent  draws  o/'xi ,  x3  ®  x3,  xx  ® 
xi  0  xi,xi  0)  x2,xi  0  x2  0  x.3  from  the  generative  process  in  Section  5.2.2.  Let  ?imax  :  = 
maxjj  \Uij\.  Then 


Prob(\\Vx  -  Vt 

h  >  e) 

< 

Prob(\\V2-V2\ 

|f  >  e) 

< 

Prob(\\V3-V3\ 

|f  >  e) 

< 

Prob(\\C2  -  C2\ 

f  >  e) 

< 

Prob(\\C3-C3\ 

f  >  e) 

< 

Proof.  Based  on  Lemma  6,  it  suffices  to  bound  cr2 


+  g2) 

A02 

?»2«ax  +  ^2)2 
Ne 2 

^3«ax  +  ^2)3 

A42 

^2«ax  +  ^2)2 

A42 

^3«ax  +  ^2)3 

A42 

in  these  five  cases: 


max  Var((xi)i) 

i 

maxVar((x1)j(x1)i) 


max  Var((x1)j(x2)j) 


<  maxE[(xi)2]  =  maxEhl  [cr2  +  (C/lii)f]  <  cr2  +  max  f/2fc, 

i  i  i,/c 

<  maxE[(x1)-(xi)2]  =  maxEhJ(a2  +  (£/hi)2)(u2  +  (£/hi)2)] 

<  max ( cr®  4-  U2t)(a2  +  U2t)  <  (cr2  +  maxC/2.)2, 

<  maxE[(xi)2(x2)2]  =  maxE^  [E[(xi)2|7r0]E[(x2)2|7r0]] 

<  max  sup  E[(xi)2|7r0]E[(x2)2|7r0]  <  ( maxsupE[(xi)2|7r0])2 

*>i  7T0  ®  7T0 

=  ( max  sup  ^2  Uij  (T7r0)k  +  a2)2 

1  7ro  k 

=  (  max  max  V"  C/2-T,y  +  cr2) 2  <  (maxC/2 ■  + cr2)2. 

v  i  j'  J  J  *  i,j  J 


With  similar  arguments,  we  have  that 


maxVar((x1)i(xi)j(xi)z)  <  (max Ui •  +  a  f, 


h3 


maxVar((x1)i(x2)j(x3)z)  <  (max  Uf  +  a2y. 


i,3 


□ 


Lemma  8.  Let  M2,  M3,  Mfi  Ml  denote  estimates  of  the  population  quantities  defined  in  Theo¬ 
rem  4  obtained  by  plugging  in  empirical  averages  of  independent  samples  as  in  Lemma  7,  and 

do  :=  Amin(V2  —  V\V\  ),  where  Amin(-)  denotes  the  smallest  eigenvalue  in  modulus.  Define 


119 


v  :=  max(cr2  +  w2riax ,  1).  We  then  have  the  following: 


Prob(\\M'2  -M!f\\ 

—  e) 

< 

Prob{\\M'z  -  M/|| 

—  e) 

< 

Prob(\\M2  —  M2\\ 

—  e) 

< 

Prob(\\Mz  —  Mf\ 

—  e) 

< 

Proof.  We  first  note  that  it  is  easy  verify  zT 
always  non-negative.  By  Lemma  3,  we  have 


75  m2u2 
Ne 2  ’ 

1000m4i/3 

’ 

50  (a0  +  l)2m2z/2 

We 2  ’ 

1100A;2m3(Q;o  +  2)2(a0  +  1)V3 

• 

+  —  V\  V\  )z  >  0  for  any  real  vector  z,  so  a2  is 


^2-a2|  <  ||+-++T-(+-++T)||  <  ||+-+||  +  ||+i+r-+i+iT 
<  11^-^11+11^-^11(11^11  +  11^11) 

<  \\v2  -  +  ||  +  2 1| Vi || || +i  -  V\ ||  +  ||  +  -  +  ||2. 


We  also  need  the  following 


<  Y1  ~j[  h  -  rnf x  u£  -  mu* 

i,j  i 


Then  we  have 


II M'  -  M' 


<  ||+  -+2||  +  \a2  -a2\ 

<  2 1|+2  -  +  ||  +  2||  +  ||||  +  -  +||  +  ||+  -  +||2 
<  2||+  -  +||F  +  2||  +  ||  ||+  -  +||  +  ||+  -  +||2, 


which  implies 

Prob(||M'  -  1+2 1|  >  e) 

<  Prob(2||+  -  +|+  +  2 1|  + 1| ||+  -  +||  +  ||+  -  +||2  >  e) 

<  Prob(2||+  -  +|+  >  e/3)  +  Prob(2||  +  || ||+  -  +||  >  e/3)  +  Prob(||+  -  +||2  >  e/3) 

<  36m2«ax  +  ^2)2  36||  +  ||2-m(M^ax  +  !72)  3m«ax  +  ff2) 

Ne2  Ne2  Ne 

<  36/n2«ax  +  ^2)2  .  36/n2w^ax«ax  +  a2)  3 m(<ax  +  a2) 

Ne2  Ne2  Ne 

<  75-m2«ax  +  a2)2 

Ne2 
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Similarly,  we  have 


||M'-M'||  <  ||l/3-f3||F  +  3||V1®1((T2/)-f1®1(a2/)||F 

=  WVs-Vsy  +  Sy/^W^-a^W 

<  \\V3  -  V3\\F  +  3y/m(cr'2\\Vi  -  t^H  +  | cr2  -  a2|(||l4||  +  ||f  -  1411)) 

<  || V3  -  f||F  +  ||^i  -  14||3v44(a2  +  2 m<ax)  +  \\V2  -  f>||3?wm 

+||K  -  f||29 Umaxm  +  3fm(||Ui  -  f||||f2  -  u2||  +  ||K  -  t4||3), 


implying 


ProbdlMg  —  M3 1|  >  e) 

<  Prob(|| V3  -  U3||f  >  e/6)  +  Prob(||^  -  f||  >  e/(18 2  +  2 m<ax))) 
+Prob(||U2  -  f>||  >  e/(18 umaxm))  +  Prob(||14  -  V4 1|2  >  e/(54 Mmaxm)) 


+Prob 


+  Prob 


vjii  > 


nn 


+Prob(|| Vi  -  t4||3  >  e/(18y/m}) 

36  m3  «ax  +  a2)3  324m2  {a2  +  2mu2mJ2{a 2  +  <^)  324<axm4(q2  +  <ax)2 

Ne2  Ne2  Ne2 

54wmaxm2(o'2  +  <ax)  18  m3/2  (a2  +  <^)  18  m5/2  (a2  +  <ax)2 

Ne  Ne  Ne 

361/3m4/3(o'2  +  <ax) 

IVe2/3 

1000m4 (max(cr2  +  rx/1RV,  l))3 

iv/2  ‘ 


Using  similar  arguments,  we  have 


||M2-M2||  <  (q;o  +  1)||C2  -C'2||F  +  a0||1414T-P?ifiT||F 

<  («0  +  1)11(72  -  S||f  +  2a0|| Vi||  Hf  -  Ui||  +  a0||f  -  14||2, 


and  therefore 

Prob(||M2-M2||  >  e) 

<  Prob(||C2  -  ft||,  >  3^)  +  ProbdlV,  -  Ull  >  j^) 

+Prob(||U  -  VIII2  >  ^-) 

oQt  0 

<  9(a0  +  l)2m2(q2  +  <ax)2  36a20m2u2max(a2  +  u2^)  3a0m(a2  +  u2max) 

Ne2  Ne2  Ne 

<  50(a0  +  l)2-m2((72  +  <ax)2 

Ne2 
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Finally,  we  have 


II M,  -  M3|| 

<  (a'°  +  2)2(a°  +  11  l|C3  -  C3\\F  +  3(a'°  +  1)a°  UVa  0!  C2  -  Vi  ®  C2\\F 

+al\\Vi  ®  Vi  ®  Vi  -  f  ®  f  ®  fi|| F 

<  (°° +  2)(o° + 11  lies  -  Qiif  +  3(a°t1)a°IIVi  -  Vi||||C2||F  +  3(a°!1)a°||Vi||||C2  -  C,\\f 


2  M'-'  1 1  2 

11^1  -  fill  +  3ckq || Vi || || Vi  -  fill2  +  al\\V,  -  fill3 


+3q;q  ||  Vi  || 2 

(«o  "F  2)  (cto  F  1)  ||  ^  ~pF 1|  3(a'o  "F  1)«o  h  t7  n  n  n  i  "F  l)cio  p  n  n n 

F  - o - ll°3  -  L3||f  H - o - II  Vl  ~  Fill  II  C'2  II  ^  H - o - II  Fill  ||<-'2  -  F2||f 


3(q;o  +  l)a0 


||F  -  F||  ||C2  -  ^1^  +  3«q || || 2 1| VI  -  Fi||  +  3^||Fi||||Fi  -  F||2  +  a2||F  -  F||: 


< 


(cio  +  2)(a0  +  1) 


IIC3  —  C3||f  +  5(ao  +  l)<xokmu‘^nay.\\Vi  —  FJ|  + 


3(a0  +  l)ap 


||Fi||||C2- C2||F 


3(q;o  +  l)cio 


||Fi  -  Fi||  ||C2  -  Ca||f  +  3a2||F||  ||F  -  F||2  +  a2||F  -  F||: 


using  the  fact  that 

||C2||f  = 

and  thus 


UT 


diag7r  +  a07r7r 

Ot  0  +  1 


T 


TtUt 


<  \\UT\\l  <  kmu 


2 

max’ 


Prob(||M3  —  M3 1|  >  e)  <  Prob  (  ||C3  —  Cs\\f  > 


+  Prob  ||Fi-Fi||f  > 


30(ao  +  l)oLokmu^ 


+  Prob  (  ||Fi-F||2> 


ISagllUII 


+  Prob  IIH  -  VI ||  ‘  > 


3(q!o  +  2)(ao  +  1) 

+  Prob  ^ 1 1 C2  —  C2||f  > 
e 


9(cio  +  l)ao 


6ao 


<  9  m2  (ftp  +  2)2(a0  +  l)2(a2  +  u2^x)3  +  900 k2m3(a0  +  1  )2ag^iax(a2  +  u 


2  ) 

max/ 


Ne 2 


Ne2 


81(a0  +  l)2apm2(o-2  +  ax)2  +  18  alm3/2umax(a2  +  M2iax)  +  6maQ/3(a2  +  w2,. 


< 


AF2  '  Ne 

1100  k2m3(a0  +  2)2(a0  +  l)2(tr2  +  li^)3 
AF2 


Ne2/3 


□ 


B.4  Proof  of  Theorem  5 

Let  U  and  UT  be  column-permuted  as  described  in  Algorithm  5.4.  Let 

<5min  :=  min  1 1/ ^  -  l/v^y|. 
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If  max(E3,  E'3)  <  <5min/15.  Theorem  5.1  of  Anandkumar  et  al.  [2012a]  implies  that  for  any 
r)  e  (0, 1),  with  probability  at  least  1  —  rj,  the  columns  of  U  and  UT  are  matched  to  the  same 
permutation  of  the  columns  of  the  true  U  and  UT,  respectively.  As  in  Lemma  5,  let  U,  UT,  and 
P  denote  proper  permutations  of  the  true  matrices.  We  then  have 


Prob  (  max(||[7  —  U\\,  \\UT  —  UT\\)  > 


ecrk(rU  +  (1  -  r)UTf 


6  ax{UT) 


<Prob  [\\U-U\\> 


eak(rU  +  (1  —  r)UT) 
6<j  \(UT) 


Let  the  failure  probability  for  the  tensor  decomposition  method  be  set  to  rj.  Then  by  Lemma  4 
we  can  bound  the  first  term  as  follows: 


Prob 

<Prob 


\\u-u ||  > 


eak(rU  +  (1  —  r)UT)' 


max(£,2,  E's)  > 


6(Ti  (UT) 

eak(rU  +  (1  -  r)UT)2n^n  min((7fc(f7)2, 1) 
6(7i(f/T)cmax((71(f/),  1) 


+  Prob  (max  (if 2,  E3 )  >  ak(M2)/2)  +  |  +  Prob(Eg  >  Amin/15) , 


where  the  first  term  in  the  r.h.s  is  based  on  Lemma  4  conditioned  on  the  event  that  max(  L(,  E',)  > 
ak(M2)/2  and  the  tensor  decomposition  method  succeeds,  the  second  and  the  third  terms  bound 
the  probability  that  the  event  does  not  occur,  and  the  last  term  bounds  the  probability  of  incor¬ 
rectly  matching  the  columns  of  U  and  U .  To  continue  bounding  these  terms  we  use  Lemma  8  to 
have 


Prob  max (E2,  E3)  > 


ecrk(rU  +  (1  -  r)UT)2n^nmm(crk(U)2, 1) 


< 


< 


6(71(f/T)cmax((71(f/),  1) 

(2700 m2u2  +  36000m4 is3)cri(UT)2c2  max(cr1(t/)2, 1) 
Ne2ak(rU  +  (1  -  r)UT)4nTnmm(ak(U)4, 1) 

39000m4 z/3(7i(f/T)2c2  max(o~ i{U)2, 1) 

Ne2crk(rU  +  (1  -  r)UT)4 7T^in  min((7fe(t/)4, 1)  ’ 

300 m2u2  +  4000m4z/3 


4,, 3 


Prob(msix(E2,  E3)  >  ak(M2)/2)  < 

225000 m4z/3 


Nak(M2 )2 


< 


4300 mV 
iVafc(M2)2’ 


Prob(£;' >  5min/15)  < 


NSL 


Thus,  by  setting  the  sample  size  N  so  that 

Ar  12  m4  z/3  f  225000  4300 

N  >  - max 


39000(7i (UT)2c2  max((7i(f/)2, 1) 


V 


we  have 


V  5min  ’  °k(M2)2  ’  e2crk(rU  +  (1  -  r)UT)4nTn  mm(crk(U)4, 1)  J  ’ 
eak(rU  +  (1  —  r)UT)2 


Prob  \\U-U\\  > 


6(7i  (UT) 


<  V~, 

~  2’ 


(B.10) 
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where  the  randomness  is  from  both  the  data  and  the  algorithm.  Using  similar  arguments,  we  have 
that  for  sample  size  N  such  that 


12  k2m3(a0  +  2)2(a0  +  l)2v3 
V 

/ 225000  4600  42000^  (UT)2{c'f  max((r1(C/T)2, 1)  \ 

maX  V  ^min  ’  °k(M' 2)2  ’  e2crk(rU  +  (1  -  r)f/T)47r^in  mm(ak(UT)A, 1) )  ’ 


the  following  holds: 


Prob  ||£/T-  f/T||  > 


eak(rU  +  (1  —  r)UT)' 
6<Ji(UT) 


<  71. 


( B  .11) 


Combining  the  two  bounds  (B.10)  and  (B.ll),  we  have  for 

>  12  rna x(k2,  m)rn3v3(a0  +  2)2(a0  +  l)2 

V 

/ 225000  4600  42000c2 cy (UT)2  max(cr1(C/T),  <Ji(U),  l)2  \ 

maX  V  5min  ’  min(crfe(M/2),  crfe(M2))2  ’  e2ak(rU  +  (1  -  r)UT)A  mm(crk(UT),  ak(U),  1  )4 )  ’ 


the  following  bound  holds  for  any  e  >  0  and  q  e  (0, 1): 


Prob 


fmax(||£7  -  £7||,  ||C/T  -  £7T||)  < 


V 


6<Ti(C/T) 


>  1 


V, 


which  by  Lemma  5  implies  that 

Prob(||P  -  (rU  +  (1  -  r)CT)tUT||  <  e)  >  1  -  q. 
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Appendix  C 

Derivations  in  Chapter  6 
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C.l  Derivation  of  (6.20) 


Using  properties  of  the  matrix  trace  and  the  kernel  trick,  we  immediately  have 

\\\ Z2PZJ  -  C2>1||^(.  tx  iTr( PTM2PM,)  -  Tr(PTF), 

~ —\\i  +  \\zipT  1  ~  —III 

2  V  ms  ms 

oc  -1  r(PTM2P  +  PMiPt)1  -  ulT(PTn2  +  P^). 

Let  A,,  ( • )  denotes  the  / - 1 h  Eigenvalue  of  a  matrix.  We  then  rewrite  the  nuclear  norm  term: 


t\\Z2PZJ\U  =  t^JX^PLJL^ZJ) 


=  T 


J\(LjPTL2LZPLi)  =  t\\LJ PLi\\*, 


C.2  Derivation  of  (6.34) 

We  begin  by  defining  some  notations: 

H  :=  UTM3U,  R  :=  UTM,U,  u  :=  UT 1,  v  :=  VT1, 

T  ~  T  ~ 

Pi  :=  QjZtf,  F-2  :=  F3  :=  $JZ3U. 

n 

Let  vec(X)  be  the  vector  resulting  from  column  concatenation  of  a  matrix  X,  diag(x)  be  the 
diagonal  matrix  with  the  vector  x  being  its  main  diagonal.  Superscripts  denote  column  indices. 
Using  properties  of  the  matrix  trace  and  the  kernel  trick,  we  re-write  the  three  terms  in  (6.34)  as 
follows.  For  the  first  term  we  have 

11^3,1,2(1-8;})  -  C3ii,2 1| g®g®g 

oc  Tr(  Y,(Z2)d(Zl2)dVB?  UtM3UBvVtM i)  - 

d  l,V 

2  E  Tr  (  E  VBjUT(Zl2)dZj 

d  l 

=Tr  (  ]T(M2  VAt HBvR  -2  FJ diag (F*)Fi) , 
w  i 

and  then  for  the  second  term 

WCsMiBt})  -C2,i\\l^g  oc 

Tr( [Biv  ■■■  Bmv]T  H  [By  Bmv]M2)- 

2Tr([JB1v  ■■■  Bmv\TUTM32PM12)  = 

Tr (  ^2(M2)aBj HByvT  -  2  ^  Bj UtM32PM(2vt)  , 

il  i 


126 


and  finally  for  the  third  term 


I|c,1i2(W)t-c2>1||J^<x 

Tr([Bju  BTmu]M2[Bju~-  BTmu}TR)~ 

2Tr( [Bj u  •••  B^u\ M2P  My V )  = 

Tr(  J2(M2)ijBT uutBjR  -  2  Bj u(M’)TM1f) . 

i 

To  further  simplify  these  expressions,  we  re-define  the  notation  B  to  be  a  k2 -by -rn  matrix  whose 
Z-th  column  Bl  denotes  column  concatenation  of  the  k-by-k  matrix  Bi  in  the  above  expressions. 
With  the  new  notation  and  the  identity: 

vec (XYZ)  =  (ZT  O  X)vec (Y)  (C.l) 

where  o  denotes  the  Kronecker  product,  we  obtain  the  succinct  form  (6.34)  in  which 

C  :=  R  o  H  +  m((vvt)  o  H  +  R  o  (uuT)), 

J  ■=  (Fy  o  F3)T[vec(diag(F21))  ■  ■  •  vec(diag(F™))] 

+  m((v  O  (UtM32P))M12  +  ((yTM!PT)  O  u )M2y 
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